inspirehep / inspire

Official repo of the legacy INSPIRE-HEP overlay
http://projecthepinspire.net
17 stars 20 forks source link

bibcheck: updated rule: fix_ref_jhep_volume #489

Closed ksachs closed 4 years ago

ksachs commented 4 years ago

Signed-off-by: Kirsten Sachs sachs@l00lnxkaos.desy.de

ksachs commented 4 years ago

sorry - even more complex than before. I guess I found most bugs, but you never know with dirty metadata. The code does what it is supposed to do, if you want to test anyhow ...

It will run once/week, --ticket-creation-policy=per-rule Just replace the plugin, we keep the current rule [KAOS_fix_ref.JCAP_JHEP]

Question is on which records. Whether you want to spend time on searching or on browsing through references. On test a bibcheck for 999C5s:JHEP* took 15minutes or so just to get the records. In principle we need:

999C5s:JCAP,20*
999C5s:JCAP,1313*
999C5s:JCAP,1414*
999C5s:JCAP,1515*
999C5s:JCAP,1616*
999C5s:JCAP,1717*
999C5s:JCAP,1818*
999C5s:JCAP,1919*
999C5s:JHEP,20*
999C5s:JHEP,1313*
999C5s:JHEP,1414*
999C5s:JHEP,1515*
999C5s:JHEP,1616*
999C5s:JHEP,1717*
999C5s:JHEP,1818*
999C5s:JHEP,1919*
999C5s:J.Stat.Mech.,20*
999C5s:J.Stat.Mech.,1313*
999C5s:J.Stat.Mech.,1414*
999C5s:J.Stat.Mech.,1515*
999C5s:J.Stat.Mech.,1616*
999C5s:J.Stat.Mech.,1717*
999C5s:J.Stat.Mech.,1818*
999C5s:J.Stat.Mech.,1919*
999C5s:/^PTEP.*,\d\d\d$/
999C5s:"MDPI Physics,*"
tsgit commented 4 years ago

@ksachs there is a bug

get_citation_for_PTEP() does not define a default for year, so it can end up using undefined year

2020-02-14 13:42:06 --> Unexpected error occurred: local variable 'year' referenced before assignment.
2020-02-14 13:42:06 --> Traceback is:
2020-02-14 13:42:07 --> * 2020-02-14 13:42:06 -> UnboundLocalError: local variable 'year' referenced before assignment (fix_ref_jhep_volume.py:287:get_citation_for_PTEP)

2020-02-14 13:42:07 --> Frame get_citation_for_PTEP in /scratch/venvs/invenio-legacy/lib/python/invenio/bibcheck_plugins/fix_ref_jhep_volume.py at line 287
2020-02-14 13:42:07 --> -------------------------------------------------------------------------------
2020-02-14 13:42:07 -->        284     elif ref_year:
2020-02-14 13:42:07 -->        285         year = ref_year
2020-02-14 13:42:07 -->        286 
2020-02-14 13:42:07 --> ---->  287     search_string = r'PTEP[^A-Za-z]*%s[^A-Za-z]*[,: ](\d{2,3}[A-Z]\d{2,3})' % year
2020-02-14 13:42:07 -->        288     search_res = re.search(search_string, text)
2020-02-14 13:42:07 -->        289     if search_res:
2020-02-14 13:42:07 -->        290         true_artid = search_res.group(1)
2020-02-14 13:42:07 --> -------------------------------------------------------------------------------
2020-02-14 13:42:07 -->                     ref_year =  "''"
2020-02-14 13:42:07 -->               recid_citation =  'None'
2020-02-14 13:42:07 -->                         text =  "' $$HS  MIZOGUCHI AND M  YATA $$M01 (2013) $$O51'"
2020-02-14 13:42:07 -->                      journal =  "'PTEP'"
2020-02-14 13:42:07 -->                       volume =  "'5'"
2020-02-14 13:42:07 -->                      ref_pbn =  "{'y': '', 'p': 'PTEP', 'c': '053', 'v': '5'}"
2020-02-14 13:42:07 -->                        debug =  'False'
2020-02-14 13:42:07 -->                      pubnote =  "''"
2020-02-14 13:42:07 --> Frame get_citation_from_pubnote in /scratch/venvs/invenio-legacy/lib/python/invenio/bibcheck_plugins/fix_ref_jhep_volume.py at line 350
2020-02-14 13:42:07 --> -------------------------------------------------------------------------------
2020-02-14 13:42:07 -->        347     """
2020-02-14 13:42:07 -->        348 
2020-02-14 13:42:07 -->        349     if bug_type == "PTEP":
2020-02-14 13:42:07 --> ---->  350         recid_citation, pubnote = get_citation_for_PTEP(ref_pbn, text, debug)
2020-02-14 13:42:07 -->        351     elif bug_type == 'JHEP':
2020-02-14 13:42:07 -->        352         recid_citation, pubnote = get_citation_for_JHEP(ref_pbn, text, debug)
2020-02-14 13:42:07 -->        353     else:
2020-02-14 13:42:07 --> -------------------------------------------------------------------------------
2020-02-14 13:42:07 -->                        debug =  'False'
2020-02-14 13:42:07 -->                         text =  "' $$hS. Mizoguchi and M. Yata $$m01 (2013) $$o51'"
2020-02-14 13:42:07 -->                     bug_type =  "'PTEP'"
2020-02-14 13:42:07 -->                      ref_pbn =  "{'y': '', 'p': 'PTEP', 'c': '053', 'v': '5'}"
2020-02-14 13:42:07 --> Frame check_record in /scratch/venvs/invenio-legacy/lib/python/invenio/bibcheck_plugins/fix_ref_jhep_volume.py at line 539
2020-02-14 13:42:07 --> -------------------------------------------------------------------------------
2020-02-14 13:42:07 -->        536                 confirmation_reason = 'RepNo'
2020-02-14 13:42:07 -->        537             if not recid_citation:
2020-02-14 13:42:07 -->        538                 recid_citation, pubnote_from_rawref = \
2020-02-14 13:42:07 --> ---->  539                     get_citation_from_pubnote(ref_pbn, bug_type, reference['subfields_text'])
2020-02-14 13:42:07 -->        540                 confirmation_reason = 'PubNote'
2020-02-14 13:42:07 -->        541             if not recid_citation:
2020-02-14 13:42:07 -->        542                 recid_citation, confirmation_reason = \
2020-02-14 13:42:07 --> -------------------------------------------------------------------------------
2020-02-14 13:42:07 -->                      tickets =  'True'
2020-02-14 13:42:07 -->               recid_citation =  'None'
2020-02-14 13:42:07 -->                    reference =  "{'subfield_0': '1204476', 'mark_line': '$$01204476$$9CURATOR$$hS. Mizoguchi and M. Yata$$m01 (2013)$$o51$$sPTEP,5,053', 'doi': '', 'subfields_text': ' $$hS. Mizoguchi and M. Yata $$m01 (2013) $$o51', 'repno': '', 'year': '', 'position_pbn': 5, 'subfield_pbn': 'PTEP,5,053', 'curator': 'C'}"
2020-02-14 13:42:07 -->          pubnote_from_rawref =  "''"
2020-02-14 13:42:07 -->                      ref_pbn =  "{'y': '', 'p': 'PTEP', 'c': '053', 'v': '5'}"
2020-02-14 13:42:07 -->                       record =  "{'595': [([('9', 'CERN'), ('a', 'CDS-1561297')], ' ', ' ', '', 20)], '773': [([('c', '055016'), ('n', '5'), ('p', 'Phys.Rev.'), ('v', 'D88'), ('y', '2013')], ' ', ' ', '', 37)], '300': [([('a', '15')], ' ', ' ', '', 14)], '999': [([('0', '140422'), ('h', 'P. Ramond'), ('m', '(Sanibel Symposia, 1979), reissued as'), ('o', '1'), ('r', 'hep-ph/9809459'), ('t', 'The Family Group in Grand Unified Theories')], 'C', '5', '', 45), ([('0', '140392'), ('h', 'H. Georgi'), ('o', '2'), ('s', 'Nucl.Phys.,B15 [...]
2020-02-14 13:42:07 -->                         m999 =  "([('0', '1204476'), ('9', 'CURATOR'), ('h', 'S. Mizoguchi and M. Yata'), ('m', '01 (2013)'), ('o', '51'), ('s', 'PTEP,5,053')], 'C', '5', '', 97)"
2020-02-14 13:42:07 -->                     bug_type =  "'PTEP'"
2020-02-14 13:42:07 -->                        fuzzy =  'False'
2020-02-14 13:42:07 -->                        recid =  "'1242133'"
ksachs commented 4 years ago

och manno!

tsgit commented 4 years ago

even without fuzzy option this will change 7930 references. looks ok, though. log attached

kaos_nonfuz.log

tsgit commented 4 years ago

looks like the fuzzy option will change more than 15k records

tsgit commented 4 years ago

make that over 22k changes with fuzzy