helixml / helix

Multi-node production GenAI stack. Run the best of open source AI easily on your own servers. Easily add knowledge from documents and scrape websites. Create your own AI by fine-tuning open source models. Integrate LLMs with APIs. Run gptscript securely on the server
https://tryhelix.ai
Other
348 stars 29 forks source link

better pdf to markdown #174

Open lukemarsden opened 9 months ago

lukemarsden commented 9 months ago

current pdf text extraction doesn't generate markdown and includes a lot of cruft

https://github.com/VikParuchuri/marker looks like it might do a better job, give it a try

lukemarsden commented 9 months ago

in particular, two column layouts - which are common in academic papers - cause absolute mayhem and i'm surprised the model can make sense of it at all