inspirehep / hepcrawl

Scrapy project for feeds into INSPIRE-HEP
http://inspirehep.net
Other
17 stars 30 forks source link

parsers: remove duplicates from inspire categories #278

Closed miguelgrc closed 4 years ago

miguelgrc commented 4 years ago

Description

Now we check for repeated values when deriving inspire categories from arxiv categories.

Related Issue

Motivation and Context

While deriving inspire categories from arxiv categories we were not checking repeated values. This is a problem because the relationship between arxiv categories and inspire categories is N:1 (i.e. both math.AP and math.DG belong to the Math and Math Physics category in Inspire, so a paper with both arxiv categories would have the Math and Physics category assigned twice). Tests also didn't cover that scenario.

Checklist: