inspirehep / refextract

Extract bibliographic references from (High-Energy Physics) articles.
GNU General Public License v2.0
130 stars 30 forks source link

Improve reference splitting #78

Closed michamos closed 4 years ago

michamos commented 4 years ago
tsgit commented 4 years ago

why is this pull request against "common_ancestor" ?

michamos commented 4 years ago

because we have two branches now, one for Python 2 and one for Python 3. The easiest way to apply the same commit to the two branches is to apply it to their latest common ancestor, then merge it onto the two branches. If you don't do that and instead develop against, say, master, then rebase (or cherry-pick, which is essentially the same) on top of the other branch, you're essentially applying a patch, so lose the benefit of 3-way merge and potentially have to solve a lot more conflicts manually.

ksachs commented 4 years ago

I didn't look at code or result. But: splitting at ;should already be done by refextract. If this is correct the collected parts belong to the same citation (record). If it is not correct, what is your algorithm to assign the authors. (sorry I'm too lazy to read the code, and I have a hard time to understand this dialect)

michamos commented 4 years ago

splitting at ;should already be done by refextract.

I completely rewrote it and at first didn't add splitting on ;. I later put it back.

If it is not correct, what is your algorithm to assign the authors.

If splitting on repeated fields, the issue is that the split happens after the authors, so extra work is needed. My approach is as follows (conceptually, in the code it's a bit more complex in order to avoid backtracking):

  1. For any citation, if there are no authors, or it has be split due to a repeated field (in which case the author probably belongs to the next citation, not the current one), try to get authors from the previous citations. Otherwise do nothing.
  2. In the previous citation, if there is exactly one author, copy it to the current one, if there's more than one, move it instead.

Note that author detection is not super reliable, so there might be errors in the process.

tsgit commented 4 years ago

I looked at differences in output of some 50 more or less randomly selected PDFs. This appears to perform as advertised and doesn't do worse than previously. Over 50% of records I analysed had substantial changes.

Because a large fraction of records has split references the texkey handling is more or less sidelined (texkeys are not used when linemarker count and ref-count after splitting don't match)

Since you rewrote splitting on ; I wonder about trailing ; in DOI for https://arXiv.org/pdf/2006.06206.pdf

{'linemarker': ['43'],                                                                                                                                            
  'raw_ref': ['[43] K. Hencken, D. Trautmann and G. Baur, Phys. Rev. A 49, 1584 (1994). doi:10.1103/PhysRevA.49.1584; Phys. Rev. A 51, 1874 (1995) doi:10.1103/Ph\
ysRevA.51.1874 [nucl-th/9410014].'],                                                                                                                              
  'author': ['K. Hencken, D. Trautmann and G. Baur'],                                                                                                             
  'journal_title': ['Phys.Rev.A'],                                                                                                                                
  'journal_volume': ['49'],                                                                                                                                       
  'journal_year': ['1994'],                                                                                                                                       
  'journal_page': ['1584'],                                                                                                                                       
  'journal_reference': ['Phys.Rev.A 49 (1994) 1584'],                                                                                                             
  'doi': ['doi:10.1103/PhysRevA.49.1584;'],                                                                                                                       
  'year': ['1994']},                                                                                                                                              
 {'linemarker': ['43'],                                                                                                                                           
  'raw_ref': ['[43] K. Hencken, D. Trautmann and G. Baur, Phys. Rev. A 49, 1584 (1994). doi:10.1103/PhysRevA.49.1584; Phys. Rev. A 51, 1874 (1995) doi:10.1103/Ph\
ysRevA.51.1874 [nucl-th/9410014].'],                                                                                                                              
  'author': ['K. Hencken, D. Trautmann and G. Baur'],                                                                                                             
  'journal_title': ['Phys.Rev.A'],                                                                                                                                
  'journal_volume': ['51'],                                                                                                                                       
  'journal_year': ['1995'],                                                                                                                                       
  'journal_page': ['1874'],                                                                                                                                       
  'journal_reference': ['Phys.Rev.A 51 (1995) 1874'],                                                                                                             
  'doi': ['doi:10.1103/PhysRevA.51.1874'],                                                                                                                        
  'reportnumber': ['nucl-th/9410014'],                                                                                                                            
  'year': ['1995']},                                                                                                                                              

it's not worse than previously, though.

also

https://s3.cern.ch/inspire-prod-files-e/e8eb855831513772d7f17a14eeaeb001

there is a clean split at ';' in ref [5], yet new code adds

'author': ['V.A. Belinskii, I.M. Khalatnikov, M.P. Ryan'],
to
'misc': ['published as Secs. 1 and 2 in M. P. Ryan]'

those are minor gripes. Basically some previous shortcomings lead to slightly different shortcomings with the new code, which makes comparing results tedious. I have not come across major issues, though.

So LGTM ready to merge