Improve reference splitting

michamos commented 4 years ago

Instead of splitting references on ; and on repeated DOIs, this will additionally split on any repeated non-repeatable field. It also goes to some effort to assign the correct authors to the reference.
I'm not so happy about hardcoding ;. This is a common (but not universal) convention for separating multiple references. I left it out at first, but then encountered a case where the first reference had an arXiv eprint only, and the second reference had only journal info, so no field was repeated (see tests). As it's better to split references when it's not needed than not split when it's needed, I put ; handling back. Note that there will be cases where the separator is not a ; but no fields are repeated, which will lead to wrong results (a mixed reference citing several records), but I don't think there's much we can do with refextract-level technology.
INSPIR-3525

tsgit commented 4 years ago

why is this pull request against "common_ancestor" ?

michamos commented 4 years ago

because we have two branches now, one for Python 2 and one for Python 3. The easiest way to apply the same commit to the two branches is to apply it to their latest common ancestor, then merge it onto the two branches. If you don't do that and instead develop against, say, master, then rebase (or cherry-pick, which is essentially the same) on top of the other branch, you're essentially applying a patch, so lose the benefit of 3-way merge and potentially have to solve a lot more conflicts manually.

ksachs commented 4 years ago

I didn't look at code or result. But: splitting at ;should already be done by refextract. If this is correct the collected parts belong to the same citation (record). If it is not correct, what is your algorithm to assign the authors. (sorry I'm too lazy to read the code, and I have a hard time to understand this dialect)

michamos commented 4 years ago

splitting at ;should already be done by refextract.

I completely rewrote it and at first didn't add splitting on ;. I later put it back.

If it is not correct, what is your algorithm to assign the authors.

If splitting on repeated fields, the issue is that the split happens after the authors, so extra work is needed. My approach is as follows (conceptually, in the code it's a bit more complex in order to avoid backtracking):

For any citation, if there are no authors, or it has be split due to a repeated field (in which case the author probably belongs to the next citation, not the current one), try to get authors from the previous citations. Otherwise do nothing.
In the previous citation, if there is exactly one author, copy it to the current one, if there's more than one, move it instead.

Note that author detection is not super reliable, so there might be errors in the process.

tsgit commented 4 years ago

I looked at differences in output of some 50 more or less randomly selected PDFs. This appears to perform as advertised and doesn't do worse than previously. Over 50% of records I analysed had substantial changes.

Because a large fraction of records has split references the texkey handling is more or less sidelined (texkeys are not used when linemarker count and ref-count after splitting don't match)

Since you rewrote splitting on ; I wonder about trailing ; in DOI for https://arXiv.org/pdf/2006.06206.pdf

{'linemarker': ['43'],                                                                                                                                            
  'raw_ref': ['[43] K. Hencken, D. Trautmann and G. Baur, Phys. Rev. A 49, 1584 (1994). doi:10.1103/PhysRevA.49.1584; Phys. Rev. A 51, 1874 (1995) doi:10.1103/Ph\
ysRevA.51.1874 [nucl-th/9410014].'],                                                                                                                              
  'author': ['K. Hencken, D. Trautmann and G. Baur'],                                                                                                             
  'journal_title': ['Phys.Rev.A'],                                                                                                                                
  'journal_volume': ['49'],                                                                                                                                       
  'journal_year': ['1994'],                                                                                                                                       
  'journal_page': ['1584'],                                                                                                                                       
  'journal_reference': ['Phys.Rev.A 49 (1994) 1584'],                                                                                                             
  'doi': ['doi:10.1103/PhysRevA.49.1584;'],                                                                                                                       
  'year': ['1994']},                                                                                                                                              
 {'linemarker': ['43'],                                                                                                                                           
  'raw_ref': ['[43] K. Hencken, D. Trautmann and G. Baur, Phys. Rev. A 49, 1584 (1994). doi:10.1103/PhysRevA.49.1584; Phys. Rev. A 51, 1874 (1995) doi:10.1103/Ph\
ysRevA.51.1874 [nucl-th/9410014].'],                                                                                                                              
  'author': ['K. Hencken, D. Trautmann and G. Baur'],                                                                                                             
  'journal_title': ['Phys.Rev.A'],                                                                                                                                
  'journal_volume': ['51'],                                                                                                                                       
  'journal_year': ['1995'],                                                                                                                                       
  'journal_page': ['1874'],                                                                                                                                       
  'journal_reference': ['Phys.Rev.A 51 (1995) 1874'],                                                                                                             
  'doi': ['doi:10.1103/PhysRevA.51.1874'],                                                                                                                        
  'reportnumber': ['nucl-th/9410014'],                                                                                                                            
  'year': ['1995']},

it's not worse than previously, though.

also

https://s3.cern.ch/inspire-prod-files-e/e8eb855831513772d7f17a14eeaeb001

there is a clean split at ';' in ref [5], yet new code adds

'author': ['V.A. Belinskii, I.M. Khalatnikov, M.P. Ryan'],
to
'misc': ['published as Secs. 1 and 2 in M. P. Ryan]'

those are minor gripes. Basically some previous shortcomings lead to slightly different shortcomings with the new code, which makes comparing results tedious. I have not come across major issues, though.

So LGTM ready to merge

inspirehep / refextract

Improve reference splitting #78