Open ksachs opened 6 years ago
recids: 650205 and 1390184 - The erratum problem is well known and would be catered to in the coming days.
recids: 642540 and 652597 - Interesting case of Labs reference matcher failure, where Legacy works fine. A fix for this has already been implemented.
recids: 1223326 and 1495903 - Not a problem with either labs or legacy. It's just people wrongly citing the records. Both records are very similar, and people would include the publication information for1223326 but include the arxiv for 1495903. Since the Labs reference matcher runs the arxiv query first, it would make the wrong associations. This is neither a matcher or a metadata problem, but just people who cite wrongly in their papers.
recids: 495427 and 488774 - Seems like a metadata problem. For example, this is a reference from record 616883: Viana, P. T. P. & Liddle, A. R. 1999, MNRAS, 303, 535 And if I see the metadata for this reference (below), it wrongly gets the arxiv_eprint "astro-ph/9902245" (which points to 495427), and Labs matches based on that, which is correct and expected behaviour. However the actual reference above (in Italics) has the arxiv: astro-ph/9803244 (which points to 488774), and legacy points to that. So this is a metadata problem.
recids: 592753 and 1359496 - I have checked like 4-5 conflicting records and it seems that Labs gets it right while Legacy doesn't.
recids: 712900 and 728159 - Problem with metadata and most likely an issue with refextract. For example for the record 749213, the reference is: Warren, S. J., et al. 2007, MNRAS, 375, 213 The journal_volume is clearly 375(which points to 728159), but as we can see below, the metadata contains journal_volume 372 (which points to 712900). The arxiv, however, is correct. And thus Labs is able to successfully identify it and match it correctly, while legacy doesn't.
Error in metadata. For example, 1085403 cites 394902 as: J. Duflo and A. P. Zuker, Phys. Rev. C 52, R23 (1995) The publicaiton_info for this reference is correct if we look at the metadata for references in 1085403. However, the same reference has a wrong arxiv id: "nucl-th/9404019" (points to 37844). It should be nucl-th/9505011 (points to 394902). In this case, Labs is doing what is expected of it, since it first tries to match using the arxiv id. However, the arxiv just points to the wrong record.
578661 and 576703 - Same problem as above. Wrong associated arxiv.
recids: 1362558 and 526861 - Same issue as above. Wrong arxiv in references.
recids: 577237 and 621602: Labs get it correct from the publication info while Legacy doesn't.
recids: 658548 and 682442 - Metadata issues. Both articles have the same start page in the same journal, same volume, same year. That can't be right.
recids: 631452 and 628449 - The wrong arxiv associated with references issue.
recids: 47300 and 1591665: Quite interesting problem there. Both records lie on the same page of the same journal, kind of like an erratum. So their publication info is the same. That's why legacy usually gets it wrong, but Labs can match them correctly via the DOIs. But this is an interesting problem in general. For such records, we will always need some other information like arxiv, doi etc to distinguish them. It would be impossible to distinguish them using just the publication info usinhg our current code, on both Legacy and Labs.
recids: 448419 and 1620629: Same arxiv Problem. I discussed it with Micha and he mentions that some records, especially similarly named ones may have this problem, as this happened during migration from spires to inspire. On our end, we can't do much about it. Not sure if changes in metadata will do anything. But on the other hand, these records are also crazy. There are lots of problems with them, for example, if I check for 448419 on the web, ScienceDirect gives very different metadata of the article than we have (https://www.sciencedirect.com/science/article/pii/S0920563298001315). Volume 67, Issues 1–3, July 1998, Pages 225-250 while we have: "journal_title": "Nucl.Phys.B Proc.Suppl.", "journal_volume": "68", "page_end": "54", "page_start": "28", "parent_recid": 481913, "parent_record": { "$ref": "http://labs.inspirehep.net/api/literature/481913" }, "year": 1998 Note the journal_volume and page_start! Secondly, people have been citing these records like crazy: For example, this guy in 507980 cites it as: R. Dijkgraaf, E. Verlinde and H. Verlinde, Notes on Matrix and Micro Strings, Nucl. Phys. B (Proc. Suppl.) 62 (1998) 348, hep-th/9709107 The paper title, authors, and arxiv correspond to each other, but the publication_info is for the other record! Similarly, here in another reference: R. Dijkgraaf, E. Verlinde and H. Verlinde, Nucl. Phys. B500 (1997) 43 (hepth/9703030); Nucl. Phys. Proc. Suppl. 62 (1988) 348 (hep-th/9709107).* These records are a bit frustrating and I am not sure what we can do about it.
recids: 644038 and 630209 - Labs gets it right, while Legacy doesn't. No apparent reasons. Publication_info, DOI, and arxiv all seem correct. Plausible reason is that 630209 comes immediately after 644038 in the same journal. But again, the metadata seems correct, and I can't really figure out the reason why Legacy is getting it wrong. In any case, Labs is doing it fine.
Oh shit! Thanks for this analysis.
I had a look at some cases. Looks like remains of wrong merges. Undoing the wrong merge results in references with conflicting information in $$0 - $$r - $$s.
I have no idea how to clean it.
legacy searches journal first, labs eprint. We don't know a priory which is right, but this causes differences in citations. Should both systems do it equally wrong if neither can get it right?
3.) pdf: [20] A. A. Aguilar-Arevalo et.al. (MiniBooNE collaboration), arXiv:1207.4809. 001182207 999C5 $$01223326$$hA. A. Aguilar-Arevalo et al.$$m(MiniBooNE collaboration)$$o20$$rarXiv:1207.4809
$$0 disagrees with $$r 001223326 035 $$9arXiv$$aoai:arXiv.org:1303.2588 001223326 773 $$c161801$$pPhys.Rev.Lett.$$v110$$y2013
001495903 035__ $$9arXiv$$aoai:arXiv.org:1207.4809
================================
4) 000616883 999C5 $$rastro-ph/9902245$$sMon.Not.Roy.Astron.Soc.,303,535
wrong merge undone https://inspirehep.net/record/edit/compare_revisions?recid=495427&rev1=20161007230049&rev2=20151220183445 now 2 records
================================ 6) pdf: Dye, S., Warren, S. J., Hambly, N. C., et al., 2006, MNRAS, 372, 1227 000749213 999C5 $$rastro-ph/0610191$$sMon.Not.Roy.Astron.Soc.,372,1227
https://inspirehep.net/record/edit/compare_revisions?recid=728159&rev1=20170929133621&rev2=20160323223009 had a pubnote for this too, now separate record
000712900 037 $$9arXiv$$aastro-ph/0603608$$castro-ph 000712900 773 $$c1227-1252$$n3$$pMon.Not.Roy.Astron.Soc.$$v372$$y2006
000728159 037 $$9arXiv$$aastro-ph/0610191$$castro-ph 000728159 773 $$c213-226$$n1$$pMon.Not.Roy.Astron.Soc.$$v375$$y2007
712900 is correct, which you see only on the pdf. legacy is correct (by chance).
=============================== 7) legacy metadata: 001085403 999C5 $$rnucl-th/9404019$$sPhys.Rev.,C52,23
wrong merge undone: https://inspirehep.net/record/edit/compare_revisions?recid=37844&rev1=20150729125930&rev2=20150722141650
now 2 records
It's correct, they have different pubnote. One is a supplement: Eur.Phys.J. C39S2 (2005) 41-61 Eur.Phys.J. C39 (2005) 41-54
Proposal: search for spires-style references ($$r, $$s only) with contradicting information. If $$s matches a record, delete (or move to $m) the information in $$r. That way labs and legacy will use the same information.
For some of these I have no idea why labs finds a wrong record. Also listed in https://app.asana.com/0/3003451971699/620773521680386
Each example is given as
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-