suqingdong / pubmed_xml

PubMed XML Parser
https://suqingdong.github.io/pubmed_xml/
4 stars 2 forks source link

Error when a new abstract has no ArticleDate and PubDate has no month, only year #1

Open mjafin opened 11 months ago

mjafin commented 11 months ago

Hi @suqingdong, Thank you for this fantastic package, I'm finding it super useful for my research. I'm going through some newly released abstracts and am hitting an error:

ParserError: Unknown string format: 2023-None-1

When I traced this back to the code, it's coming from

pdat = util.check_date(Article.find('ArticleDate') if Article.find('ArticleDate') is not None else Article.find('Journal/JournalIssue/PubDate'))

and further

def check_date(element):
    year = element.findtext('Year')
    month = element.findtext('Month')
    day = element.findtext('Day') or '1'

    return parse_date(f'{year}-{month}-{day}')

The issue here is that the article (PMID 36911757) currently has no ArticleDate and PubDate only has year in it, so the month doesn't parse. Any thoughts on how to address this?

suqingdong commented 11 months ago

Hi @suqingdong, Thank you for this fantastic package, I'm finding it super useful for my research. I'm going through some newly released abstracts and am hitting an error:

ParserError: Unknown string format: 2023-None-1

When I traced this back to the code, it's coming from

pdat = util.check_date(Article.find('ArticleDate') if Article.find('ArticleDate') is not None else Article.find('Journal/JournalIssue/PubDate'))

and further

def check_date(element):
    year = element.findtext('Year')
    month = element.findtext('Month')
    day = element.findtext('Day') or '1'

    return parse_date(f'{year}-{month}-{day}')

The issue here is that the article (PMID 36911757) currently has no ArticleDate and PubDate only has year in it, so the month doesn't parse. Any thoughts on how to address this?

the bug has been fixed in version: v1.0.1

image

mjafin commented 11 months ago

@suqingdong cheers for the prompt fix, much appreciated. There is another issue I identified in pubmed_xml/core/parser.py, namely if Article.find('Journal/ISSN') is not present, then Article.find('Journal/ISSN').attrib['IssnType'] will make the code error out. I made a dummy fix at: https://github.com/suqingdong/pubmed_xml/compare/master...mjafin:pubmed_xml:master#diff-a27d47d226e680c1e795927eefc42106838f5022f6fd56a814b6949f93547d07R62 using Article.find('Journal/ISSN').attrib['IssnType'] if Article.find('Journal/ISSN') else 'NA'