huridocs / pdf_paragraphs_extraction

MIT License
49 stars 7 forks source link

Slow for long PDFs #116

Closed loganlebanoff closed 6 months ago

loganlebanoff commented 6 months ago

Hi, I have a PDF that is 700 pages long. It takes 30 seconds to extract paragraphs from it when I used the docker-compose setup.

Is there a way to speed this up? Perhaps by using GPU or parallelizing computation among multiple workers?

gabriel-piles commented 6 months ago

Hello,

Thank you for your interest.

Unfortunately, it is not easy to speed up the process. Let me explain.

For 700 pages the time for each step is as follow:

So, theoretically, the second and fourth steps can be speed up but it requires some work that we are going to do in the future. Unfortunately, we can not give right now an estimation of when this is going to happen.

If you are interested in contributing to the project, you are welcome.

Thank you

loganlebanoff commented 6 months ago

I see. Thank you for the detailed explanation!