sciunto-org / python-bibtexparser

Bibtex parser for Python 3
https://bibtexparser.readthedocs.io
MIT License
468 stars 130 forks source link

Wrong formatting leads entries to be silently ignored #230

Closed tomleung1996 closed 1 year ago

tomleung1996 commented 5 years ago

Hi, I found that some entries are missing when I load bibtex file using bibtexparser.load. My data is exported from Web of Science Core Collections, and should contains 500 entries per file, but only 498 or 499 entries were parsed. And I don't know why, can you help me?

My code is shown below:

import bibtexparser
import re

bibtex_filename = r'C:\Users\Tom\PycharmProjects\wos_crawler\output\advanced_query\2019-01-20-12.43.24\2001-2500.bib'
with open(bibtex_filename, 'r', encoding='utf-8') as bibtex_file:
    bib_db = bibtexparser.load(bibtex_file)

for i in bib_db.entries:
    print(i['unique-id'])

# with open(bibtex_filename, 'r', encoding='utf-8') as file:
#    pattern = re.compile(r'Unique-ID = {{(.+)}}')
#    result = pattern.findall(file.read())

# for i in result:
#    print('{'+i+'}')

And my sample file is attached here. I found that "ISI:000435037500001" article is missing. 2001-2500.zip

omangin commented 5 years ago

Thanks @tomleung1996 for reporting this.

The entry that does not work is the following:

@article{ ISI:000435037500001,
Author = {Maiello, Mark L. and Cole, Jessica and Vernetti, Elaine},
Title = {{BRIDGING THE EXPERTISE GAP: DEVELOPING AND UTILIZING A RADIOLOGICAL
   ADVISORY COMMITTEE FOR NEW YORK CITY}},
Journal = {{HEALTH SECURITY}},
Year = {{2018}},
Volume = {{16}},
Number = {{3}},
Pages = {{204-212}},
Month = {{JUN}},
Abstract = {{A significant radiological emergency response in New York City would
   require scientific expertise beyond the routine capability of the New
   York City Department of Health and Mental Hygiene (DOHMH) and its
   partner agencies. Health physicists (radiological safety specialists)
   are chronically in short supply in the United States, which translates
   into a limited supply available to local health departments facing a
   radiological crisis. These professionals support medicine, industry, and
   the military in routine, nonemergency situations. In order to prearrange
   the availability of this expertise, a radiological advisory committee
   (RAC) was formed. The committee engages leading experts in the fields of
   radiation medicine and environmental radiation science in anticipation
   of the technical questions that arise from the clinical aspects of
   internalized radioactivity and the mitigation of the urban environment
   following a terrorist attack using radioactive materials. The creation
   of the RAC and its application in a nonemergency public policy forum is
   described, as are the problems foreseen in operationalizing the RAC
   during an emergency. Some conclusions are drawn about the effort and
   cost of maintaining the RAC and the benefits obtained by maintaining it.
   This information may be useful for other jurisdictions seeking to form a
   similar expert committee.}},
Publisher = {{MARY ANN LIEBERT, INC}},
Address = {{140 HUGUENOT STREET, 3RD FL, NEW ROCHELLE, NY 10801 USA}},
Type = {{Article}},
Language = {{English}},
Affiliation = {{Maiello, ML (Reprint Author), Bur Agcy Preparedness \& Response, New York City Dept Hlth \& Mental Hyg, 42-09 28th St,6th Floor, Queens, NY 11101 USA.
   Maiello, Mark L.; Cole, Jessica; Vernetti, Elaine, New York City Dept Hlth \& Mental Hyg, Off Emergency Preparedness \& Response, Long Isl City, NY USA.}},
DOI = {{10.1089/hs.2018.0009}},
Early Access Date = {{JUN}},
Early Access Year = {{2018}},
ISSN = {{2326-5094}},
EISSN = {{2326-5108}},
Keywords = {{Radiological event; Dirty bomb; Advisory panel}},
Research-Areas = {{Public, Environmental \& Occupational Health}},
Web-of-Science-Categories  = {{Public, Environmental \& Occupational Health}},
Author-Email = {{mmaiello@health.nyc.gov}},
Cited-References = {{Blinder AS, 2000, 7909 NBER.
   Conference of Radiation Control Program Directors Inc, 2015, CRCPD PUBL, V15-1.
   Fitch K, 2001, RAND UCLA APPROPRIAT.
   Health Physics Society, 2004, HUM CAP CRIS TASK FO.
   Moss ML, 2012, DYNAMIC POPULATION M.
   MURPHY MK, 1998, HEALTH TECHNOL ASSES, V2, P1, DOI DOI 10.3310/HTA2030.
   Office of International Cooperation Radiation Medical Science Center Fukushima Medical University, HLTH MAN SURV.
   Shangraw R., 2003, EXPERT COMMITTE 0626.
   Tsuda T, 2016, EPIDEMIOLOGY, V27, P316, DOI 10.1097/EDE.0000000000000385.
   US Census Bureau, QUICK FACTS.
   US Centers for Disease Control and Prevention, 2012, NATL VITAL STAT REPO, V60.
   US Environmental Protection Agency, 2017, PROT ACT GUID.
   Wakeford R, 2016, EPIDEMIOLOGY, V27, pE20, DOI 10.1097/EDE.0000000000000466.}},
Number-of-Cited-References = {{13}},
Times-Cited = {{0}},
Usage-Count-Last-180-days = {{1}},
Usage-Count-Since-2013 = {{1}},
Journal-ISO = {{Health Secur.}},
Doc-Delivery-Number = {{GK0UG}},
Unique-ID = {{ISI:000435037500001}},
DA = {{2019-01-20}},
}

The issue is that it is not correct latex (at least as far as what this project can parse) because of the following declarations:

Early Access Date = {{JUN}},
Early Access Year = {{2018}},

in which the field name contains spaces.

You should be able to fix that by replacing the spaces by '-'. The parser should however probably raise an exception there which it does not.

We will have to investigate more on why it silently ignores the issue instead of raising.

tomleung1996 commented 5 years ago

Thanks @omangin , I have already solved it by replacing the spaces by '-'. However, I also notice that bibtexparser has the problem of suppressing exceptions. eg. I wrote some customizations with bugs I didn't notice, but the program doesn't report any error. I found that error when I was trying to parse the same content in a Web of Science plaintext format manually.

omangin commented 5 years ago

Yes, that is possible. Depending on the bibtex format descriptions there can be a pretty wide definition of comments. A lot of the issues can fall there and get unnoticed. Although I am not sure that this is what happened for you.

Do you have any specific examples of wrong formatting that is ignored? that would help us improve the reporting.

MiWeiss commented 1 year ago

Duplicate of #211