Closed deakkon closed 4 years ago
Hi Jurica,
Can you point location to the Medline ftp file? I can load and check from there.
Thanks for reporting :)
Here you go:
['22351618'] /home/docClass/files/pubmed/medline17n1089.xml.gz ['24653627'] /home/docClass/files/pubmed/medline17n0789.xml.gz ['28005260'] /home/docClass/files/pubmed/medline17n1135.xml.gz ['26762307'] /home/docClass/files/pubmed/medline17n0855.xml.gz ['28005260'] /home/docClass/files/pubmed/medline17n0926.xml.gz ['22351618'] /home/docClass/files/pubmed/medline17n0718.xml.gz ['26762307'] /home/docClass/files/pubmed/medline17n0947.xml.gz ['18006916'] /home/docClass/files/pubmed/medline17n0584.xml.gz ['28005260'] /home/docClass/files/pubmed/medline17n0929.xml.gz ['27983391'] /home/docClass/files/pubmed/medline17n0908.xml.gz ['23456555'] /home/docClass/files/pubmed/medline17n0751.xml.gz ['25371446'] /home/docClass/files/pubmed/medline17n0811.xml.gz ['23357879'] /home/docClass/files/pubmed/medline17n0748.xml.gz
There is a bunch of others which fit the bill as well (Im discovering them as I go through the documents). Some of the PMIDs are repeating across multiple gz files; I assume that only one is the actual copy, all the others are previous versions?
@deakkon, can you give the link from ftp
site to these files. I didn't see it from here: ftp://ftp.nlm.nih.gov/nlmdata/.medleasebaseline/gz/
. Basically, I cannot get files from the list that you gave me.
Thanks!
Hi,
sorry, I completly forgot to answer... :/
I get my files from
ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/
and ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/
You can find the referenced gz files there. values in between [ ] are PMIDs for which an error has been reported.
Also, for several PMC documents, I get an _TypeError: sequence item 0: expected string, NoneType found; e.g.
In [315]: pp.parse_pubmed_xml('/home/docClass/files/pmc/PMC2480501.nxml')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-315-5a8c63256ce3> in <module>()
----> 1 pp.parse_pubmed_xml('/home/docClass/files/pmc/PMC2480501.nxml')
/root/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pubmed_parser/pubmed_oa_parser.pyc in parse_pubmed_xml(path, include_path)
119 issn_tmp = tree.findall('//issn[@pub-type="ppub"]')
120 if issn_tmp is not None:
--> 121 issn_ppub = ' '.join([j.text for j in issn_tmp])
122 else:
123 issn_ppub = ''
TypeError: sequence item 0: expected string, NoneType found
More of the same:
/home/docClass/files/pmc/PMC4569628.nxml
/home/docClass/files/pmc/PMC2480498.nxml
/home/docClass/files/pmc/PMC2480500.nxml
I get these files from ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/ (either via the big gz files or via the links from the oa_file_list.txt
Hope that helps! Sorry once again for the late reply.
@deakkon, no problem at all. Thanks for updating the file location. I'll check it out asap and return back to you.
Additionally, a full list of PMC files which I got Error: it was not able to read a path, a file-like object, or a string as an XML
/home/docClass/files/pmc/PMC2480501.nxml /home/docClass/files/pmc/PMC4569614.nxml /home/docClass/files/pmc/PMC5362956.nxml /home/docClass/files/pmc/PMC4569628.nxml /home/docClass/files/pmc/PMC4162892.nxml /home/docClass/files/pmc/PMC2480500.nxml /home/docClass/files/pmc/PMC2480499.nxml /home/docClass/files/pmc/PMC2480498.nxml /home/docClass/files/pmc/PMC2480496.nxml /home/docClass/files/pmc/PMC5348996.nxml /home/docClass/files/pmc/PMC5352161.nxml /home/docClass/files/pmc/PMC5362810.nxml /home/docClass/files/pmc/PMC2479409.nxml /home/docClass/files/pmc/PMC5363022.nxml /home/docClass/files/pmc/PMC5352154.nxml /home/docClass/files/pmc/PMC2480502.nxml /home/docClass/files/pmc/PMC4522714.nxml /home/docClass/files/pmc/PMC2480497.nxml /home/docClass/files/pmc/PMC5346358.nxml /home/docClass/files/pmc/PMC4522719.nxml
Just close this issue due to inactivity. If there is the problem coming up, please do re-open the issue.
For a (small?) subset of documents, only part of the abstract is extracted (e.g. PMID 24653627, 23357879, 27983391, 26762307, 28005260, 22351618, 23456555,18006916,25371446)