neuml / paperetl

📄 ⚙️ ETL processes for medical and scientific papers
Apache License 2.0
352 stars 27 forks source link

Handle PDF parsing exceptions #22

Closed nialov closed 3 years ago

nialov commented 3 years ago

I created an articles.sqlite from some older/varying academic article PDFs which probably had faulty/lacking metadata and which were probably invalidly parsed by GROBID. I was able to run almost all PDFs with the changes proposed in this pull request and if the PDF parsing completely failed, I added a try-except-clause in src/python/paperetl/file/execute.py which catched the exception.

Most exceptions are caused by None AttributeErrors returned from BeautifulSoup finds/title-attributes. I made gists of other errors:

The changes in src/python/paperetl/study/design.py might not be necessary (and I wasn't really sure of what I was doing.) The same error is caught by the try-except-clause in src/python/paperetl/file/execute.py. Could just alternatively raise a ValueError when length is zero.

In tei.py I used source as alternative for title when title was not parsed. Might be ok when articles are anyway uniquely named?

coveralls commented 3 years ago

Pull Request Test Coverage Report for Build 458832613


Changes Missing Coverage Covered Lines Changed/Added Lines %
src/python/paperetl/analysis.py 1 2 50.0%
src/python/paperetl/file/tei.py 3 5 60.0%
src/python/paperetl/file/pdf.py 0 7 0.0%
<!-- Total: 5 15 33.33% -->
Totals Coverage Status
Change from base Build 429040998: -0.5%
Covered Lines: 546
Relevant Lines: 815

💛 - Coveralls
davidmezzetti commented 3 years ago

Thank you for this pull request! Looks like you've got a good grasp of the code and made good progress.

The errors have me a bit curious. Are the files you're working with public domain (i.e. on arXiv or something similar), that can be found on the web?

While the changes work around the issues, I think the parser is skipping over text it could capture. The two errors seem related.

1.) Referencing the original tei.py file, instead of skipping over elements without a text element, how about converting them to text? The error said the elements were NavigableString elements.

Change line 168 from:

text = " ".join([str(e.text) for e in children])

to:

text = " ".join([str(e.text) if hasattr(e, "text") else str(e) for e in children])

2.) Attribute errors - It looks like the sections being passed into the Study.parse method are empty. I think the best change to make on this would be to change Study.parse in analysis.py, adding the following to the beginning of the method (line 60):

# Return default values if sections is empty
if not sections:
    return (0, None, None, None, [])

if this change works, I think you would be able to revert the changes to execute.py and design.py.

Couple other things:

1.) Line 229 (your version) tei.py

sections = TEI.text(soup, title) if TEI is not None else "sections"

I don't think TEI could ever be None since it is a class. I believe this could be removed.

Line 244 (your version) tei.py

I would wrap that in None validation vs catching an AttributeError

if title or reference:
    uid = hashlib.sha1(title.encode("utf-8") if title else reference.encode("utf-8")).hexdigest()
else:
    uid = None

Line 164 - 166 (your version) tei.py can be simplified to:

if len(children) > 0 and not children[0].name:

2.) I see one of the commits rolled back a few autoformatting changes. I still see a few lines changed that appear to be formatting in nature vs code changes. For the lines with changes, please match the original formatting.

nialov commented 3 years ago

Thanks for the input! I've implemented the changes as you described and tested the changes with the same database as used earlier. All PDFs were able to be parsed now and the try-except-clause in execute.py was not required.

Only change I'm wondering if it might have side-effects is in src/python/paperetl/file/tei.py, line 213:

title = soup.title.text if soup.title is not None else source

For instance, should the hashing here

if title or reference:
    uid = hashlib.sha1(title.encode("utf-8") if title else reference.encode("utf-8")).hexdigest()
else:
    uid = None

prioritize title even if the title was set to source?

I also had to add an additional exception handling in src/python/paperetl/file/tei.py, lines 157-160 (I am not entirely sure if what caused this to pop up?):

# Verify that BeautifulSoup object is not None and has find_all method
if not hasattr(soup.find("text"), "find_all"):
    return sections

I can privately send the PDFs that had the faults requiring the exception-handling I've implemented (3/4 of them were downloadable from Google Scholar found links, the fourth I'm not sure where I've got. Might be course material, etc.). These were probably not the only faulty PDFs out of this collection of 172 PDFs as I added the PDFs into the test set as the errors occurred but I cannot say how common any of the faults were as I was just interested in fixing each fault as they came along.

davidmezzetti commented 3 years ago

Thank you for the additional work. This is very helpful and I think I understand what is happening now. There must be PDF files that are failing from GROBID. I have been able to reproduce this and there should be error handling to catch this.

There is also a case where GROBID does parse a file but it doesn't parse anything useful.

I think these two situations are the root causes of these errors.

I've attached a patch with a subset of your changes and a couple additional ones that I think would be the best path to fixing this. Can you try it out on your data?

pdfexceptions.txt

You may need to manually apply the changes. The changes are only to paperetl/file/execute.py, paperetl/file/pdf.py and paperetl/file/tei.py, so it shouldn't be too bad. You should be able to revert all other changes.

If these changes work, we should be able to merge this pull request in.

nialov commented 3 years ago

I implemented changes in pdfexceptions.txt but had to add some of the other exception handles to run my test files:

# Return default values if sections is empty
if not sections:
    return (0, None, None, None, [])

# Attempt to parse section header
if len(children) > 1 and not children[0].name:
    name = str(children[0]).upper()
    children = children[1:]
else:
    name = None

text = " ".join([str(e.text) if hasattr(e, "text") else str(e) for e in children])

Also the make test-suite is now failing with this message:

https://gist.github.com/nialov/ba486fa693d437adb15afd5f28b7ad1d

davidmezzetti commented 3 years ago

Thank you for the quick changes.

The change in analysis.py is good, it's a good check to have.

With tei.py if you change:

if len(children) > 1 and not children[0].name:

to

if children and not children[0].name:

That should resolve the test issue.

I don't think the following change is needed after making the change above.

text = " ".join([str(e.text) if hasattr(e, "text") else str(e) for e in children])

bs4 parses sections into a NavigableString and the remaining div/p sections as elements with a text attribute. Sections don't always have a header. But the only time it should be a NavigableString is when it is a element.

nialov commented 3 years ago

I refactored the children check and tests work now but

text = " ".join([str(e.text) if hasattr(e, "text") else str(e) for e in children])

was still needed.

davidmezzetti commented 3 years ago

Thank you your diligence on this. I've merged in the pull request. I appreciate the changes!

Please feel free to submit additional PRs if you encounter additional issues or if it's found this change didn't fully parse the documents correctly.