Closed michamos closed 4 years ago
why is this pull request against "common_ancestor" ?
because we have two branches now, one for Python 2 and one for Python 3. The easiest way to apply the same commit to the two branches is to apply it to their latest common ancestor, then merge it onto the two branches. If you don't do that and instead develop against, say, master, then rebase (or cherry-pick, which is essentially the same) on top of the other branch, you're essentially applying a patch, so lose the benefit of 3-way merge and potentially have to solve a lot more conflicts manually.
I didn't look at code or result. But:
splitting at ;
should already be done by refextract. If this is correct the collected parts belong to the same citation (record).
If it is not correct, what is your algorithm to assign the authors. (sorry I'm too lazy to read the code, and I have a hard time to understand this dialect)
splitting at
;
should already be done by refextract.
I completely rewrote it and at first didn't add splitting on ;
. I later put it back.
If it is not correct, what is your algorithm to assign the authors.
If splitting on repeated fields, the issue is that the split happens after the authors, so extra work is needed. My approach is as follows (conceptually, in the code it's a bit more complex in order to avoid backtracking):
Note that author detection is not super reliable, so there might be errors in the process.
I looked at differences in output of some 50 more or less randomly selected PDFs. This appears to perform as advertised and doesn't do worse than previously. Over 50% of records I analysed had substantial changes.
Because a large fraction of records has split references the texkey
handling is more or less sidelined (texkeys are not used when linemarker count and ref-count after splitting don't match)
Since you rewrote splitting on ;
I wonder about trailing ;
in DOI
for https://arXiv.org/pdf/2006.06206.pdf
{'linemarker': ['43'],
'raw_ref': ['[43] K. Hencken, D. Trautmann and G. Baur, Phys. Rev. A 49, 1584 (1994). doi:10.1103/PhysRevA.49.1584; Phys. Rev. A 51, 1874 (1995) doi:10.1103/Ph\
ysRevA.51.1874 [nucl-th/9410014].'],
'author': ['K. Hencken, D. Trautmann and G. Baur'],
'journal_title': ['Phys.Rev.A'],
'journal_volume': ['49'],
'journal_year': ['1994'],
'journal_page': ['1584'],
'journal_reference': ['Phys.Rev.A 49 (1994) 1584'],
'doi': ['doi:10.1103/PhysRevA.49.1584;'],
'year': ['1994']},
{'linemarker': ['43'],
'raw_ref': ['[43] K. Hencken, D. Trautmann and G. Baur, Phys. Rev. A 49, 1584 (1994). doi:10.1103/PhysRevA.49.1584; Phys. Rev. A 51, 1874 (1995) doi:10.1103/Ph\
ysRevA.51.1874 [nucl-th/9410014].'],
'author': ['K. Hencken, D. Trautmann and G. Baur'],
'journal_title': ['Phys.Rev.A'],
'journal_volume': ['51'],
'journal_year': ['1995'],
'journal_page': ['1874'],
'journal_reference': ['Phys.Rev.A 51 (1995) 1874'],
'doi': ['doi:10.1103/PhysRevA.51.1874'],
'reportnumber': ['nucl-th/9410014'],
'year': ['1995']},
it's not worse than previously, though.
also
https://s3.cern.ch/inspire-prod-files-e/e8eb855831513772d7f17a14eeaeb001
there is a clean split at ';' in ref [5], yet new code adds
'author': ['V.A. Belinskii, I.M. Khalatnikov, M.P. Ryan'],
to
'misc': ['published as Secs. 1 and 2 in M. P. Ryan]'
those are minor gripes. Basically some previous shortcomings lead to slightly different shortcomings with the new code, which makes comparing results tedious. I have not come across major issues, though.
So LGTM ready to merge
;
and on repeated DOIs, this will additionally split on any repeated non-repeatable field. It also goes to some effort to assign the correct authors to the reference.;
. This is a common (but not universal) convention for separating multiple references. I left it out at first, but then encountered a case where the first reference had an arXiv eprint only, and the second reference had only journal info, so no field was repeated (see tests). As it's better to split references when it's not needed than not split when it's needed, I put;
handling back. Note that there will be cases where the separator is not a;
but no fields are repeated, which will lead to wrong results (a mixed reference citing several records), but I don't think there's much we can do withrefextract
-level technology.