jennis0 / burdoc

Advanced PDF parsing for python
MIT License
4 stars 2 forks source link

burdoc aborting while trying parsing a PDF file #6

Open MrUnknown789556 opened 1 year ago

MrUnknown789556 commented 1 year ago

I have some PDF file from the same Publisher that I want to investigate into using burdoc. Some of these files are parsed elegant, while others are not parsed at all.

I use the cli version of burdoc.

Attached here are one PDF file that are not being parsed and the log file generated.

The.pdf burdoc log.log

jennis0 commented 1 year ago

Thanks for highlighting. I'm currently travelling but will try and take a look at this and your other opened issue this weekend!

MrUnknown789556 commented 1 year ago

Dear Dennis. Thanks for caring. Is it possible to extract the text without the tables as an option? I use the CLI, so it may very well be with a parameter specifying that option. It should be possible, especially because you wrote: "ML-Powered Table Extraction: Burdoc makes use of the latest machine learning models for identifying tables, alongside a rules-based approach to identify inline tables.". It sounds impressive to me. This could also mean that you have all the tables in your hands, and it should therefore not be a huge problem to not include them in the text if they are not wanted. By the way, what are "inline tables"? Have a nice trip and a safe return (I need extracted texts from different PDF files, hopefully without the tables included). Best regardsFrank NielsenDenmark

Den torsdag den 25. maj 2023 kl. 00.52.41 CEST skrev jennis0 ***@***.***>:  

Thanks for highlighting. I'm currently travelling but will try and take a look at this and your other opened issue this weekend!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

jennis0 commented 1 year ago

Fixed the crash - should be available in v0.2.2.

On tables - unfortunately it seems like the PDF you posted doesn't seem to get it's tables detected by the extraction algorithm currently in use. I'm hoping a branch I'm currently working on will greatly improve this but it might be a little way off depending on the amount of time I have to commit. In the meantime, you can use the '--no-ml-tables' option in the CLI to get a big speed-boost given it won't find the tables anyway...

It might also be worth mentioning that Burdoc is currently extremely poor at handling maths, it's something I'd like to improve but that will likely be quite a while in the future!

If you don't mind, I'd be quite keen to make use of your file for future testing as it seems like it's hits a few edge cases I'd like to fix.

MrUnknown789556 commented 1 year ago

Thanks for the quick reply. I will supply you with further problematic PDF files in the near future. Don't worry. Best regardsFrank Den fredag den 26. maj 2023 kl. 18.57.17 CEST skrev jennis0 @.***>:

Fixed the crash - should be available in v0.2.2.

On tables - unfortunately it seems like the PDF you posted doesn't seem to get it's tables detected by the extraction algorithm currently in use. I'm hoping a branch I'm currently working on will greatly improve this but it might be a little way off depending on the amount of time I have to commit. In the meantime, you can use the '--no-ml-tables' option in the CLI to get a big speed-boost given it won't find the tables anyway...

It might also be worth mentioning that Burdoc is currently extremely poor at handling maths, it's something I'd like to improve but that will likely be quite a while in the future!

If you don't mind, I'd be quite keen to make use of your file for future testing as it seems like it's hits a few edge cases I'd like to fix.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

MrUnknown789556 commented 1 year ago

Jennis. If you could supply me with the slightly modified and corrected version you made recently, I could make some tests about its ability to find the outlines of academic articles. I have collected about 20.000 scientific articles over the time I worked as an engineer. I could make a little script calling (the updated) burdoc in a loop up against these files and do some logging of the outcome. I hope to hear from you. Best regardsFrank Nielsen

----- Videresendt meddelelse ----- Fra: Frank Nielsen @.>Til: jennis0/burdoc @.>Sendt: fredag den 26. maj 2023 kl. 19.21.04 CESTEmne: Re: [jennis0/burdoc] burdoc aborting while trying parsing a PDF file (Issue #6) Thanks for the quick reply. I will supply you with further problematic PDF files in the near future. Don't worry. Best regardsFrank Den fredag den 26. maj 2023 kl. 18.57.17 CEST skrev jennis0 @.***>:

Fixed the crash - should be available in v0.2.2.

On tables - unfortunately it seems like the PDF you posted doesn't seem to get it's tables detected by the extraction algorithm currently in use. I'm hoping a branch I'm currently working on will greatly improve this but it might be a little way off depending on the amount of time I have to commit. In the meantime, you can use the '--no-ml-tables' option in the CLI to get a big speed-boost given it won't find the tables anyway...

It might also be worth mentioning that Burdoc is currently extremely poor at handling maths, it's something I'd like to improve but that will likely be quite a while in the future!

If you don't mind, I'd be quite keen to make use of your file for future testing as it seems like it's hits a few edge cases I'd like to fix.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

jennis0 commented 1 year ago

Hi Frank,

The fixed version should be available either by installing directly from the current repo or from pip as v0.2.2 (https://pypi.org/project/burdoc/0.2.2/)

Would be great to hear about the performance you find - I've not done much of an assessment of academic articles and it'd be useful to prioritise any future work (whenever I finally find some time to do some more dev!)

Joe