titipata / pubmed_parser

:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
http://titipata.github.io/pubmed_parser/
MIT License
564 stars 164 forks source link

Abstract partically extracted #51

Closed deakkon closed 4 years ago

deakkon commented 6 years ago

For a (small?) subset of documents, only part of the abstract is extracted (e.g. PMID 24653627, 23357879, 27983391, 26762307, 28005260, 22351618, 23456555,18006916,25371446)

titipata commented 6 years ago

Hi Jurica,

Can you point location to the Medline ftp file? I can load and check from there.

Thanks for reporting :)

deakkon commented 6 years ago

Here you go:

['22351618'] /home/docClass/files/pubmed/medline17n1089.xml.gz ['24653627'] /home/docClass/files/pubmed/medline17n0789.xml.gz ['28005260'] /home/docClass/files/pubmed/medline17n1135.xml.gz ['26762307'] /home/docClass/files/pubmed/medline17n0855.xml.gz ['28005260'] /home/docClass/files/pubmed/medline17n0926.xml.gz ['22351618'] /home/docClass/files/pubmed/medline17n0718.xml.gz ['26762307'] /home/docClass/files/pubmed/medline17n0947.xml.gz ['18006916'] /home/docClass/files/pubmed/medline17n0584.xml.gz ['28005260'] /home/docClass/files/pubmed/medline17n0929.xml.gz ['27983391'] /home/docClass/files/pubmed/medline17n0908.xml.gz ['23456555'] /home/docClass/files/pubmed/medline17n0751.xml.gz ['25371446'] /home/docClass/files/pubmed/medline17n0811.xml.gz ['23357879'] /home/docClass/files/pubmed/medline17n0748.xml.gz

There is a bunch of others which fit the bill as well (Im discovering them as I go through the documents). Some of the PMIDs are repeating across multiple gz files; I assume that only one is the actual copy, all the others are previous versions?

titipata commented 6 years ago

@deakkon, can you give the link from ftp site to these files. I didn't see it from here: ftp://ftp.nlm.nih.gov/nlmdata/.medleasebaseline/gz/. Basically, I cannot get files from the list that you gave me.

Thanks!

deakkon commented 6 years ago

Hi,

sorry, I completly forgot to answer... :/

I get my files from

ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/ and ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/

You can find the referenced gz files there. values in between [ ] are PMIDs for which an error has been reported.

Also, for several PMC documents, I get an _TypeError: sequence item 0: expected string, NoneType found; e.g.

In [315]: pp.parse_pubmed_xml('/home/docClass/files/pmc/PMC2480501.nxml')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-315-5a8c63256ce3> in <module>()
----> 1 pp.parse_pubmed_xml('/home/docClass/files/pmc/PMC2480501.nxml')

/root/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pubmed_parser/pubmed_oa_parser.pyc in parse_pubmed_xml(path, include_path)
    119     issn_tmp = tree.findall('//issn[@pub-type="ppub"]')
    120     if issn_tmp is not None:
--> 121         issn_ppub = ' '.join([j.text for j in issn_tmp])
    122     else:
    123         issn_ppub = ''

TypeError: sequence item 0: expected string, NoneType found

More of the same:

/home/docClass/files/pmc/PMC4569628.nxml
/home/docClass/files/pmc/PMC2480498.nxml
/home/docClass/files/pmc/PMC2480500.nxml

I get these files from ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/ (either via the big gz files or via the links from the oa_file_list.txt

Hope that helps! Sorry once again for the late reply.

titipata commented 6 years ago

@deakkon, no problem at all. Thanks for updating the file location. I'll check it out asap and return back to you.

deakkon commented 6 years ago

Additionally, a full list of PMC files which I got Error: it was not able to read a path, a file-like object, or a string as an XML

/home/docClass/files/pmc/PMC2480501.nxml /home/docClass/files/pmc/PMC4569614.nxml /home/docClass/files/pmc/PMC5362956.nxml /home/docClass/files/pmc/PMC4569628.nxml /home/docClass/files/pmc/PMC4162892.nxml /home/docClass/files/pmc/PMC2480500.nxml /home/docClass/files/pmc/PMC2480499.nxml /home/docClass/files/pmc/PMC2480498.nxml /home/docClass/files/pmc/PMC2480496.nxml /home/docClass/files/pmc/PMC5348996.nxml /home/docClass/files/pmc/PMC5352161.nxml /home/docClass/files/pmc/PMC5362810.nxml /home/docClass/files/pmc/PMC2479409.nxml /home/docClass/files/pmc/PMC5363022.nxml /home/docClass/files/pmc/PMC5352154.nxml /home/docClass/files/pmc/PMC2480502.nxml /home/docClass/files/pmc/PMC4522714.nxml /home/docClass/files/pmc/PMC2480497.nxml /home/docClass/files/pmc/PMC5346358.nxml /home/docClass/files/pmc/PMC4522719.nxml

titipata commented 4 years ago

Just close this issue due to inactivity. If there is the problem coming up, please do re-open the issue.