gptscript-ai / knowledge

Knowledge for GPTScript
https://gptscript-ai.github.io/knowledge/
Apache License 2.0
24 stars 11 forks source link

Chore: parse pdf as markdown via go-fitz (MuPDF) and remove vendor #21

Closed StrongMonkey closed 3 months ago

StrongMonkey commented 3 months ago

This PR adds the following things:

  1. Switch to using https://github.com/gen2brain/go-fitz instead of unidoc. Although the underlying mupdf lib is still AGPL, we can use the latest code as long as we stay in open source.

  2. Parse pdf text as markdown and store it in vector DB. this seems to improve the accuracy in how LLM understand it.

  3. Remove all the vendor files. It happens that when using https://github.com/gen2brain/go-fitz, we can't use vendor because it doesn't include c code by default. see https://github.com/gen2brain/go-fitz/issues/60 and https://github.com/golang/go/issues/27667.

With the change, we only have three test cases failing!

image
StrongMonkey commented 3 months ago

I have tested this PR against linux and windows, both works fine.

Although we need to modify the build process to do cross-compiling now as it involves C library. This is a bit tricky and I will have a PR later to fix the build and CI.