references: improve extracting links

inspirehep / refextract

Extract bibliographic references from (High-Energy Physics) articles.

GNU General Public License v2.0

130 stars 30 forks source link

references: improve extracting links #92

Closed MJedr closed 2 years ago

MJedr commented 2 years ago

fix extracting links for two columns layout
dedupe url list
ref inspirehep/inspirehep#2241

michamos commented 2 years ago

Note that this doesn't completely solve inspirehep/inspirehep#2241 because there might be a duplicate when the URL gets added as a DOI to the reference in case there's already a DOI in the text (see example in the issue).

MJedr commented 2 years ago

But how it's possible at this level? We add all the extracted references only to reference field https://github.com/inspirehep/refextract/blob/8cdb6f1d37b140f3b9bd05b06b52aabaf1463e0c/refextract/references/record.py#L162-L164. And in schemas we don't add duplicated dois in builder https://github.com/inspirehep/inspire-schemas/blob/ce8a2a6dc4d9a360aae5fe9a6fd5d8e0209fac48/inspire_schemas/builders/references.py#L290

michamos commented 2 years ago

That line in the builder is broken. We check if the unnormalized value is present among the DOIs, we should normalize before checking.

MJedr commented 2 years ago

Ok then, I'll do the fix in builder too.