Parse Pubmed OA Paragraph

titipata / pubmed_parser

:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset

http://titipata.github.io/pubmed_parser/

MIT License

564 stars 164 forks source link

Parse Pubmed OA Paragraph #59

Closed soupstandstop closed 4 years ago

soupstandstop commented 5 years ago

Hi, Why did I enter: pp.parse_pubmed_paragraph('data/6605965a.nxml', all_paragraph=False) the return is empty list?

soupstandstop commented 5 years ago

I supposed that it will return the section text of this PMC?

titipata commented 5 years ago

@soupstandstop, in the particular file, there is no text content in the given PMC. You can check the file inside data folder.

soupstandstop commented 5 years ago

Yes, but in the case of the file that there have text in PMC, I still have the empty list.

soupstandstop commented 5 years ago

ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/00/00/ The nxml file in PMC5640403.tar.gz You can check this file, the return is still empty?

titipata commented 5 years ago

@soupstandstop, sorry for the late reply and thanks! I can check it later this week. If you find the way to resolve the issue, please feel free to make the Pull request tho!

soupstandstop commented 5 years ago

I make the Pull request, if have any problem please tell me, thanks!

MananVyas24 commented 5 years ago

Observed the same behavior for all NXML (even from the data samples) for the function pp.parse_pubmed_paragraph. Debugging, as we speak, to know why this is happening and will keep you updated.

titipata commented 5 years ago

Thanks so much @MananVyas24. Let me know if you figure out where the error comes from. I do not have much time to debug but will check out the PR as soon as possible!

titipata commented 4 years ago

@soupstandstop I checked and b2ccfe7 and a769744 fix this issue. Let me know if you have any problem parsing the paragraph text using a recent version of pubmed_parser.

Below, I attach a snippet to parse nxml from ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/00/00/PMC5640403.tar.gz

import pubmed_parser as pp
import pandas as pd

paragraphs = pp.parse_pubmed_paragraph('ott-10-4895.nxml')

>> [
{'pmc': '5640403',
 'pmid': '29070952',
  'reference_ids': [
     'b1-ott-10-4895',
     'b2-ott-10-4895',
     'b3-ott-10-4895',
     'b4-ott-10-4895',
     'b5-ott-10-4895',
     'b6-ott-10-4895'],
  'section': 'Introduction',
  'text': 'With an incidence rate ...
}, ...
]