pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
539 stars 82 forks source link

Add show_progress option to to_markdown() #122

Closed zane-programs closed 2 months ago

zane-programs commented 2 months ago

Hello there!

I'm not sure if it would be useful to others, but for my project I wanted to add a show_progress option to to_markdown() so I could see progress as each page was converted to Markdown.

Please let me know what you think. Thank you so much!

~Zane

zane-programs commented 2 months ago

Also willing to use or roll a progress bar instead of print, if useful. :)

jamie-lemon commented 2 months ago

I'm personally a fan of progress bars, would love to see what a roll or progress bar looks like in this context too. :) . Maybe we add another parameter for progress type, with "print", "roll" or "progress" available as options?

zane-programs commented 2 months ago

Sorry, when I said roll I meant I could "roll" my own progress bar. I considered it, but due to inconsistencies in some terminal emulators, I decided to try out a more robust progress bar solution (tqdm). That said, I'd also be down to write a very short iteration-based progress spinner in addition or instead.

Here's a demo GIF in case you want to see how it looks before running it yourself.

pymupdf4llm_demo_show_progress

Please let me know what you think. Thank you! :)

JorjMcKie commented 2 months ago

While progress information may be a desirable (albeit no vital) feature, we will not add more external dependencies to the package. We may consider to offer progress information output via Python built-in features like print or logging.

jamie-lemon commented 2 months ago

Agreed - I don't think we should have any more dependencies within the package. I would be in favour of just using print here.

zane-programs commented 2 months ago

Good point! So as to avoid the external dependency, I quickly threw together a progress bar helper with no external dependencies. Please see the demo gif below and my the recent commit in my fork.

pymupdf4llm_demo_progress_2

If that's still not a great choice, I'm happy to revert to a basic print-based approach. While I understand that progress display isn't a critical feature, it still could useful in workflows that involve processing extremely large/long PDFs.

jamie-lemon commented 2 months ago

@zane-programs @JorjMcKie Just trying this out and I think it works really well. Tried it with the classic MuPDF Explored doc ( see: https://mupdf.com/docs/mupdf_explored.pdf )

Please check out the DX (Developer Experience) video attached and compare the first minute to the last minute.

I would argue that show_progress should default to True , I can't see any good reason not to do that. So personally I think we should do that and then get this PR merged in :)

https://github.com/user-attachments/assets/c9af614a-d244-4b9e-88f3-b767855b41f2

JorjMcKie commented 2 months ago

@zane-programs @JorjMcKie Just trying this out and I think it works really well. Tried it with the classic MuPDF Explored doc ( see: https://mupdf.com/docs/mupdf_explored.pdf )

Please check out the DX (Developer Experience) video attached and compare the first minute to the last minute.

I would argue that show_progress should default to True , I can't see any good reason not to do that. So personally I think we should do that and then get this PR merged in :)

Screen.Recording.2.mp4

I agree Jamie, thanks for the insight! @zane-programs - before we can proceed please make sure to have read and confirmed agreement to the Artifex Contributor License Agreement (CLA). It can be viewed / downloaded here. If you agree, please insert a comment in this thread with the wording "I herewith confirm that I have read the Artifex CLA and agree with it".

Thank you for contributing!

zane-programs commented 2 months ago

I herewith confirm that I have read the Artifex CLA and agree with it.

zane-programs commented 2 months ago

Great, all should be done! I also made show_progress default to True as @jamie-lemon had suggested.

Would you like me to bump the version number, or would you?

JorjMcKie commented 2 months ago

Thank you! I will change the version number. More changes will be added soon.