Closed nialov closed 3 years ago
Changes Missing Coverage | Covered Lines | Changed/Added Lines | % | ||
---|---|---|---|---|---|
src/python/paperetl/analysis.py | 1 | 2 | 50.0% | ||
src/python/paperetl/file/tei.py | 3 | 5 | 60.0% | ||
src/python/paperetl/file/pdf.py | 0 | 7 | 0.0% | ||
<!-- | Total: | 5 | 15 | 33.33% | --> |
Totals | |
---|---|
Change from base Build 429040998: | -0.5% |
Covered Lines: | 546 |
Relevant Lines: | 815 |
Thank you for this pull request! Looks like you've got a good grasp of the code and made good progress.
The errors have me a bit curious. Are the files you're working with public domain (i.e. on arXiv or something similar), that can be found on the web?
While the changes work around the issues, I think the parser is skipping over text it could capture. The two errors seem related.
1.) Referencing the original tei.py file, instead of skipping over elements without a text element, how about converting them to text? The error said the elements were NavigableString elements.
Change line 168 from:
text = " ".join([str(e.text) for e in children])
to:
text = " ".join([str(e.text) if hasattr(e, "text") else str(e) for e in children])
2.) Attribute errors - It looks like the sections being passed into the Study.parse method are empty. I think the best change to make on this would be to change Study.parse in analysis.py, adding the following to the beginning of the method (line 60):
# Return default values if sections is empty
if not sections:
return (0, None, None, None, [])
if this change works, I think you would be able to revert the changes to execute.py and design.py.
Couple other things:
1.) Line 229 (your version) tei.py
sections = TEI.text(soup, title) if TEI is not None else "sections"
I don't think TEI could ever be None since it is a class. I believe this could be removed.
Line 244 (your version) tei.py
I would wrap that in None validation vs catching an AttributeError
if title or reference:
uid = hashlib.sha1(title.encode("utf-8") if title else reference.encode("utf-8")).hexdigest()
else:
uid = None
Line 164 - 166 (your version) tei.py can be simplified to:
if len(children) > 0 and not children[0].name:
2.) I see one of the commits rolled back a few autoformatting changes. I still see a few lines changed that appear to be formatting in nature vs code changes. For the lines with changes, please match the original formatting.
Thanks for the input! I've implemented the changes as you described and tested the changes with the same database as used earlier. All PDFs were able to be parsed now and the try-except
-clause in execute.py
was not required.
Only change I'm wondering if it might have side-effects is in src/python/paperetl/file/tei.py
, line 213:
title = soup.title.text if soup.title is not None else source
For instance, should the hashing here
if title or reference:
uid = hashlib.sha1(title.encode("utf-8") if title else reference.encode("utf-8")).hexdigest()
else:
uid = None
prioritize title even if the title was set to source?
I also had to add an additional exception handling in src/python/paperetl/file/tei.py
, lines 157-160 (I am not entirely sure if what caused this to pop up?):
# Verify that BeautifulSoup object is not None and has find_all method
if not hasattr(soup.find("text"), "find_all"):
return sections
I can privately send the PDFs that had the faults requiring the exception-handling I've implemented (3/4 of them were downloadable from Google Scholar found links, the fourth I'm not sure where I've got. Might be course material, etc.). These were probably not the only faulty PDFs out of this collection of 172 PDFs as I added the PDFs into the test set as the errors occurred but I cannot say how common any of the faults were as I was just interested in fixing each fault as they came along.
Thank you for the additional work. This is very helpful and I think I understand what is happening now. There must be PDF files that are failing from GROBID. I have been able to reproduce this and there should be error handling to catch this.
There is also a case where GROBID does parse a file but it doesn't parse anything useful.
I think these two situations are the root causes of these errors.
I've attached a patch with a subset of your changes and a couple additional ones that I think would be the best path to fixing this. Can you try it out on your data?
You may need to manually apply the changes. The changes are only to paperetl/file/execute.py, paperetl/file/pdf.py and paperetl/file/tei.py, so it shouldn't be too bad. You should be able to revert all other changes.
If these changes work, we should be able to merge this pull request in.
I implemented changes in pdfexceptions.txt but had to add some of the other exception handles to run my test files:
# Return default values if sections is empty
if not sections:
return (0, None, None, None, [])
# Attempt to parse section header
if len(children) > 1 and not children[0].name:
name = str(children[0]).upper()
children = children[1:]
else:
name = None
text = " ".join([str(e.text) if hasattr(e, "text") else str(e) for e in children])
Also the make
test-suite is now failing with this message:
https://gist.github.com/nialov/ba486fa693d437adb15afd5f28b7ad1d
Thank you for the quick changes.
The change in analysis.py is good, it's a good check to have.
With tei.py if you change:
if len(children) > 1 and not children[0].name:
to
if children and not children[0].name:
That should resolve the test issue.
I don't think the following change is needed after making the change above.
text = " ".join([str(e.text) if hasattr(e, "text") else str(e) for e in children])
bs4 parses
sections into a NavigableString and the remaining div/p sections as elements with a text attribute. Sections don't always have a header. But the only time it should be a NavigableString is when it is a element.I refactored the children check and tests work now but
text = " ".join([str(e.text) if hasattr(e, "text") else str(e) for e in children])
was still needed.
Thank you your diligence on this. I've merged in the pull request. I appreciate the changes!
Please feel free to submit additional PRs if you encounter additional issues or if it's found this change didn't fully parse the documents correctly.
I created an
articles.sqlite
from some older/varying academic article PDFs which probably had faulty/lacking metadata and which were probably invalidly parsed by GROBID. I was able to run almost all PDFs with the changes proposed in this pull request and if the PDF parsing completely failed, I added atry-except
-clause insrc/python/paperetl/file/execute.py
which catched the exception.Most exceptions are caused by
None
AttributeErrors
returned from BeautifulSoup finds/title-attributes. I made gists of other errors:ValueError
catched inexecute.py
AttributeError
catched intei.py
The changes in
src/python/paperetl/study/design.py
might not be necessary (and I wasn't really sure of what I was doing.) The same error is caught by thetry-except
-clause insrc/python/paperetl/file/execute.py
. Could just alternatively raise aValueError
when length is zero.In
tei.py
I usedsource
as alternative fortitle
whentitle
was not parsed. Might be ok when articles are anyway uniquely named?