Closed jakelever closed 2 years ago
The above commit fixes the small file test error and gets us to the same error message as the large run. For the test file uploaded above:
$ python src/convert.py --i Molecules/PMC6259225.nxml --iFormat pmcxml --o test.bioc.xml --oFormat biocxml
Converting 1 files to test.bioc.xml
Traceback (most recent call last):
File "src/convert.py", line 26, in <module>
convert(inFiles,inFormat,args.o,outFormat)
File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/main.py", line 49, in convert
for bioc_doc in docs2bioc(in_file, in_format, **kwargs):
File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/pmcxml.py", line 378, in pmcxml2bioc
for pmc_doc in process_pmc_file(source, tag_handlers=tag_handlers):
File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/pmcxml.py", line 330, in process_pmc_file
article_elem, tag_handlers=tag_handlers
File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/pmcxml.py", line 134, in extract_article_content
article_elem.findall("./body"), tag_handlers=tag_handlers, annotations_map=annotations_map
File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 368, in extract_text_chunks
raw_text_chunks.extend(tag_handler(elem, tag_handlers))
File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 226, in tag_handler
child_passages.extend(tag_handler(child, custom_handlers=custom_handlers))
File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 226, in tag_handler
child_passages.extend(tag_handler(child, custom_handlers=custom_handlers))
File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 226, in tag_handler
child_passages.extend(tag_handler(child, custom_handlers=custom_handlers))
[Previous line repeated 5 more times]
File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 225, in tag_handler
for child in merge_adjacent_xref_siblings(elem):
File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 175, in merge_adjacent_xref_siblings
prev_tail = siblings[-1].tail.strip()
AttributeError: 'NoneType' object has no attribute 'strip'
@jakelever I've fixed the issue here and this file now runs without errors. Can you try again on your end and confirm?
Sorry for the slow progress. I start the full run again and had another issue with PMC6766160.nxml.gz. Think I updated to your newest code. Error is below.
$ python src/convert.py --iFormat pmcxml --i Int_J_Genomics/PMC6766160.nxml --o test.bioc.xml --oFormat biocxml
Converting 1 files to test.bioc.xml
Traceback (most recent call last):
File "src/convert.py", line 26, in <module>
convert(inFiles,inFormat,args.o,outFormat)
File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/main.py", line 49, in convert
for bioc_doc in docs2bioc(in_file, in_format, **kwargs):
File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/pmcxml.py", line 378, in pmcxml2bioc
for pmc_doc in process_pmc_file(source, tag_handlers=tag_handlers):
File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/pmcxml.py", line 330, in process_pmc_file
article_elem, tag_handlers=tag_handlers
File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/pmcxml.py", line 134, in extract_article_content
article_elem.findall("./body"), tag_handlers=tag_handlers, annotations_map=annotations_map
File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 368, in extract_text_chunks
raw_text_chunks.extend(tag_handler(elem, tag_handlers))
File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 226, in tag_handler
child_passages.extend(tag_handler(child, custom_handlers=custom_handlers))
File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 226, in tag_handler
child_passages.extend(tag_handler(child, custom_handlers=custom_handlers))
File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 226, in tag_handler
child_passages.extend(tag_handler(child, custom_handlers=custom_handlers))
File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 225, in tag_handler
for child in merge_adjacent_xref_siblings(elem):
File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 182, in merge_adjacent_xref_siblings
siblings[-1].text = siblings[-1].text + siblings[-1].tail + elem.text
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
No worries! Thanks for testing this for me! I'll get right on debugging this new one
Ok, I have fixed this one and I added tests for all combinations of str/None in that function
@jakelever I tried running the full command myself but it seems like the preparePMC script is failing now b/c PMC may have changed the way their ftp is structured?
Yes, crikey. It looks like they're now doing incremental updates. I'll have a think about how to change the workflow for that.
Looks like we may have a suitable short term fix just using the deprecated folders
In September 2021 PMC released new bulk download directory structures and packages to our FTP service for two datasets: the PMC Open Access (OA) Subset and the Author Manuscript Dataset.
The old bulk download structure remained in place until December 2021; the week of December 5-11 the old bulk files were moved respectively to sub-directories of oa_bulk and manuscript both named "deprecated". These directories named "deprecated" are temporary directories and will be deleted in March 2022. Learn more: https://www.ncbi.nlm.nih.gov/pmc/about/new-in-pmc/#2021-09-21
It looks the deprecated data doesn't contain the XML versions for the oa_bulk sets, for some reason. So we unfortunately can't really use them. But I propose I create another issue to deal with that issue and a proposed fix (which I think I've mostly got working).
For this issue, I used an old version of PMC data and have successfully run your new code across all of PMC with no other issues. So shall we merge the pull request and close the branch?
ok, sounds good!
Hey, I'm just running some tests on the re-write-parser branch as we discussed. I tried to do a full-run and ran into an error below. I narrowed it down to a file (I think) and had to fix an error there too.
Full Run Issue
Small File Test Case Error
I got it to dump out which file it was processing when it crashed and it seems to be the file Molecules/PMC6259225.nxml from the comm_use.I-N.xml.tar.gz archive. I have attached it.
PMC6259225.nxml.gz
I got a different error (due to my hacky fix for the invalid PMC XML files)