Testing re-write-parser branch

jakelever commented 2 years ago

Hey, I'm just running some tests on the re-write-parser branch as we discussed. I tried to do a full-run and ran into an error below. I narrowed it down to a file (I think) and had to fix an error there too.

Full Run Issue

# Commands for a full run
snakemake --cores 1 downloaded.flag
snakemake --cores 8 converted.flag

Traceback (most recent call last):
  File "src/convertPMC.py", line 48, in <module>
    for bioc_doc in pmcxml2bioc(io.StringIO(data)):
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/pmcxml.py", line 372, in pmcxml2bioc
    for pmc_doc in process_pmc_file(source, tag_handlers=tag_handlers):
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/pmcxml.py", line 324, in process_pmc_file
    article_elem, tag_handlers=tag_handlers
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/pmcxml.py", line 134, in extract_article_content
    article_elem.findall("./body"), tag_handlers=tag_handlers, annotations_map=annotations_map
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 368, in extract_text_chunks
    raw_text_chunks.extend(tag_handler(elem, tag_handlers))
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 226, in tag_handler
    child_passages.extend(tag_handler(child, custom_handlers=custom_handlers))
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 226, in tag_handler
    child_passages.extend(tag_handler(child, custom_handlers=custom_handlers))
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 226, in tag_handler
    child_passages.extend(tag_handler(child, custom_handlers=custom_handlers))
  [Previous line repeated 5 more times]
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 225, in tag_handler
    for child in merge_adjacent_xref_siblings(elem):
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 175, in merge_adjacent_xref_siblings
    prev_tail = siblings[-1].tail.strip()
AttributeError: 'NoneType' object has no attribute 'strip'

Small File Test Case Error

I got it to dump out which file it was processing when it crashed and it seems to be the file Molecules/PMC6259225.nxml from the comm_use.I-N.xml.tar.gz archive. I have attached it.

PMC6259225.nxml.gz

I got a different error (due to my hacky fix for the invalid PMC XML files)

(mypy3) [jlever@munin biotext]$ python src/convert.py --i Molecules/PMC6259225.nxml --iFormat pmcxml --o test.bioc.xml --oFormat biocxml
Converting 1 files to test.bioc.xml
Traceback (most recent call last):
  File "src/convert.py", line 26, in <module>
    convert(inFiles,inFormat,args.o,outFormat)
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/main.py", line 49, in convert
    for bioc_doc in docs2bioc(in_file, in_format, **kwargs):
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/pmcxml.py", line 372, in pmcxml2bioc
    for pmc_doc in process_pmc_file(source, tag_handlers=tag_handlers):
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/pmcxml.py", line 252, in process_pmc_file
    content = source.read()
AttributeError: 'str' object has no attribute 'read'

jakelever commented 2 years ago

The above commit fixes the small file test error and gets us to the same error message as the large run. For the test file uploaded above:

$ python src/convert.py --i Molecules/PMC6259225.nxml --iFormat pmcxml --o test.bioc.xml --oFormat biocxml
Converting 1 files to test.bioc.xml
Traceback (most recent call last):
  File "src/convert.py", line 26, in <module>
    convert(inFiles,inFormat,args.o,outFormat)
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/main.py", line 49, in convert
    for bioc_doc in docs2bioc(in_file, in_format, **kwargs):
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/pmcxml.py", line 378, in pmcxml2bioc
    for pmc_doc in process_pmc_file(source, tag_handlers=tag_handlers):
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/pmcxml.py", line 330, in process_pmc_file
    article_elem, tag_handlers=tag_handlers
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/pmcxml.py", line 134, in extract_article_content
    article_elem.findall("./body"), tag_handlers=tag_handlers, annotations_map=annotations_map
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 368, in extract_text_chunks
    raw_text_chunks.extend(tag_handler(elem, tag_handlers))
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 226, in tag_handler
    child_passages.extend(tag_handler(child, custom_handlers=custom_handlers))
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 226, in tag_handler
    child_passages.extend(tag_handler(child, custom_handlers=custom_handlers))
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 226, in tag_handler
    child_passages.extend(tag_handler(child, custom_handlers=custom_handlers))
  [Previous line repeated 5 more times]
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 225, in tag_handler
    for child in merge_adjacent_xref_siblings(elem):
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 175, in merge_adjacent_xref_siblings
    prev_tail = siblings[-1].tail.strip()
AttributeError: 'NoneType' object has no attribute 'strip'

creisle commented 2 years ago

@jakelever I've fixed the issue here and this file now runs without errors. Can you try again on your end and confirm?

jakelever commented 2 years ago

Sorry for the slow progress. I start the full run again and had another issue with PMC6766160.nxml.gz. Think I updated to your newest code. Error is below.

$ python src/convert.py --iFormat pmcxml --i Int_J_Genomics/PMC6766160.nxml --o test.bioc.xml --oFormat biocxml
Converting 1 files to test.bioc.xml
Traceback (most recent call last):
  File "src/convert.py", line 26, in <module>
    convert(inFiles,inFormat,args.o,outFormat)
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/main.py", line 49, in convert
    for bioc_doc in docs2bioc(in_file, in_format, **kwargs):
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/pmcxml.py", line 378, in pmcxml2bioc
    for pmc_doc in process_pmc_file(source, tag_handlers=tag_handlers):
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/pmcxml.py", line 330, in process_pmc_file
    article_elem, tag_handlers=tag_handlers
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/pmcxml.py", line 134, in extract_article_content
    article_elem.findall("./body"), tag_handlers=tag_handlers, annotations_map=annotations_map
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 368, in extract_text_chunks
    raw_text_chunks.extend(tag_handler(elem, tag_handlers))
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 226, in tag_handler
    child_passages.extend(tag_handler(child, custom_handlers=custom_handlers))
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 226, in tag_handler
    child_passages.extend(tag_handler(child, custom_handlers=custom_handlers))
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 226, in tag_handler
    child_passages.extend(tag_handler(child, custom_handlers=custom_handlers))
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 225, in tag_handler
    for child in merge_adjacent_xref_siblings(elem):
  File "/projects/jlever/github/biotext_testing/biotext/src/bioconverters/utils.py", line 182, in merge_adjacent_xref_siblings
    siblings[-1].text = siblings[-1].text + siblings[-1].tail + elem.text
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

creisle commented 2 years ago

No worries! Thanks for testing this for me! I'll get right on debugging this new one

creisle commented 2 years ago

Ok, I have fixed this one and I added tests for all combinations of str/None in that function

creisle commented 2 years ago

@jakelever I tried running the full command myself but it seems like the preparePMC script is failing now b/c PMC may have changed the way their ftp is structured?

jakelever commented 2 years ago

Yes, crikey. It looks like they're now doing incremental updates. I'll have a think about how to change the workflow for that.

creisle commented 2 years ago

Looks like we may have a suitable short term fix just using the deprecated folders

In September 2021 PMC released new bulk download directory structures and packages to our FTP service for two datasets: the PMC Open Access (OA) Subset and the Author Manuscript Dataset.

The old bulk download structure remained in place until December 2021; the week of December 5-11 the old bulk files were moved respectively to sub-directories of oa_bulk and manuscript both named "deprecated". These directories named "deprecated" are temporary directories and will be deleted in March 2022. Learn more: https://www.ncbi.nlm.nih.gov/pmc/about/new-in-pmc/#2021-09-21

see: https://ftp.ncbi.nlm.nih.gov/pub/pmc/readme.txt

jakelever commented 2 years ago

It looks the deprecated data doesn't contain the XML versions for the oa_bulk sets, for some reason. So we unfortunately can't really use them. But I propose I create another issue to deal with that issue and a proposed fix (which I think I've mostly got working).

For this issue, I used an old version of PMC data and have successfully run your new code across all of PMC with no other issues. So shall we merge the pull request and close the branch?

creisle commented 2 years ago

ok, sounds good!

jakelever / biotext

Testing re-write-parser branch #5

Full Run Issue

Small File Test Case Error