Closed fareshan closed 5 months ago
Thank you for using the feature and for your feedback.
Processing speed is primarily determined by the presence of tables - as you noticed. We are always interested in improving this: primarily table recognition / identification is the major challenge, not so much table extraction once we know what the tables and their cells on a page are.
This whole feature currently is pure Python code. While seeking to further improve table identification, we also are investigating porting the code down to our base C library MuPDF. This will however require some time, and we have no estimates for this yet. We certainly do understand that MuPDF's various language bindings would immediately profit from the availability of table recognition in C. Bindings of MuPDF beyond Python exist for JavaScript, Java and soon also for C#.
If would be a great help if you could provide us with this "long-runner" example PDF so we can have a look whether immediate speed ups are in reach.
BTW have you noticed the new package pdf4llm on PyPI? It provides convenient access to this feature.
Thank you for your explanation about speed correlation to table recognition.
Unfortunately, I couldn't find the files with which I had the speed issue yesterday. However, this is good news because it implies that the speed issue is a lot less significant than what I initially thought!
I agree, it will probably be better once you port this to the base C library MuPDF.
Yes, I noticed the package availability on PyPI yesterday. 👍
I believe this library holds significant value in the context of LLMs. It would be even more valuable if its performance could be enhanced. Currently, it takes about 9 seconds to process 10 pages (with a long table) on an M1 processor.
Do you think incorporating Cython or Numba could help improve its speed? And are there any plans to implement such improvements?