inspirehep / hepcrawl

Scrapy project for feeds into INSPIRE-HEP
http://inspirehep.net
Other
17 stars 30 forks source link

loader: digest "all" possible date formats #169

Open fschwenn opened 7 years ago

fschwenn commented 7 years ago

Loader should include some normalization routine to handle dates in different formats.

Expected Behavior

Such a normalization routine would be called for each date field in the record ensuring that the data fit the schema, like "2017 Sep 1" -> "2017-09-01", "2017-Sep-1" -> "2017-09-01", "2017 Sep-Oct" -> "2017", "01.09.2017" -> "2017-09-01"

Current Behavior

I have to admit, I do not know to what extent it is already implemented in hepcrawl. In the harvesting-kit each publisher program has its own normalization code. At DESY we have a hand-written function which tries to catch most the cases.

Context

We will have to write a lot of spiders. It would save time, if we could just map the date-fields without thinking about the format.

michamos commented 7 years ago

There now is a date util, in particular normalize_date, that can be used to normalize any (incomplete) date:

In [1]: from inspire_utils.date import normalize_date

In [2]: normalize_date("2017 Sep 1")
Out[2]: '2017-09-01'

In [3]: normalize_date("2017-Sep-1")
Out[3]: '2017-09-01'

In [4]: normalize_date("2017 Sep-Oct")
[...]
ValueError: Unknown string format

In [5]: normalize_date("01.09.2017")
Out[5]: '2017-01-09'

Date ranges are not suported yet, are they a common occurence? if so we need to extend the utils to understand them. Also the last case is interpreted wrongly, but is ambiguous so we would need to make a choice here. Do you think your interpretation is more common?

michamos commented 6 years ago

@fschwenn did you see my question about date ranges?