inspirehep / inspire-dojson

INSPIRE-specific rules to transform from MARCXML to JSON and back.
GNU General Public License v3.0
3 stars 18 forks source link

multiple legacy_creation_dates crash marcxml2record #210

Closed tsgit closed 6 years ago

tsgit commented 6 years ago

https://sentry.inspirehep.net/inspire-sentry/prod/issues/71500/

In HepNames some records preserve the creation dates of their ancestors that were merged, so subfield 961__x may be repeated.

000982164 961__ $$x1996-09-01$$x2006-04-21
000982164 961__ $$c2011-06-30$$c2013-03-09
000982182 961__ $$x2000-05-08$$x2008-06-30
000982182 961__ $$c2011-09-06$$c2009-06-07
000982514 961__ $$x2000-04-10$$x2008-02-14
000982514 961__ $$c2009-06-07$$c2013-04-08
000982535 961__ $$x1996-07-15$$x2008-07-25
000982535 961__ $$c2009-06-07
001005647 961__ $$x1992-06-25$$x1996-07-15
001005647 961__ $$c2009-06-07
001013833 961__ $$x1988-05-22$$x1990-05-28
001013833 961__ $$c2009-06-07

currently this fails conversion at

https://github.com/inspirehep/inspire-dojson/blob/master/inspire_dojson/common/rules.py#L873

/scratch/venvs/dojson/lib/python2.7/site-packages/inspire_dojson/common/rules.pyc in legacy_creation_date(self, key, value)
    871         return self['legacy_creation_date']
    872 
--> 873     return normalize_date(value.get('x'))
    874 
    875 

/scratch/venvs/dojson/lib/python2.7/site-packages/inspire_utils/date.pyc in normalize_date(date, **kwargs)
    230         return
    231 
--> 232     return PartialDate.parse(date, **kwargs).dumps()
    233 
    234 

/scratch/venvs/dojson/lib/python2.7/site-packages/inspire_utils/date.pyc in parse(cls, date, **kwargs)
    147         default_date2 = datetime.datetime(2, 2, 2)
    148 
--> 149         parsed_date1 = parse_date(date, default=default_date1, **kwargs)
    150         parsed_date2 = parse_date(date, default=default_date2, **kwargs)
    151 

/scratch/venvs/dojson/lib/python2.7/site-packages/dateutil/parser/_parser.pyc in parse(timestr, parserinfo, **kwargs)
   1310         return parser(parserinfo).parse(timestr, **kwargs)
   1311     else:
-> 1312         return DEFAULTPARSER.parse(timestr, **kwargs)
   1313 
   1314 

/scratch/venvs/dojson/lib/python2.7/site-packages/dateutil/parser/_parser.pyc in parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
    602                                                       second=0, microsecond=0)
    603 
--> 604         res, skipped_tokens = self._parse(timestr, **kwargs)
    605 
    606         if res is None:

/scratch/venvs/dojson/lib/python2.7/site-packages/dateutil/parser/_parser.pyc in _parse(self, timestr, dayfirst, yearfirst, fuzzy, fuzzy_with_tokens)
    678 
    679         res = self._result()
--> 680         l = _timelex.split(timestr)         # Splits the timestr into tokens
    681 
    682         skipped_idxs = []

/scratch/venvs/dojson/lib/python2.7/site-packages/dateutil/parser/_parser.pyc in split(cls, s)
    205     @classmethod
    206     def split(cls, s):
--> 207         return list(cls(s))
    208 
    209     @classmethod

/scratch/venvs/dojson/lib/python2.7/site-packages/dateutil/parser/_parser.pyc in __init__(self, instream)
     74         elif getattr(instream, 'read', None) is None:
     75             raise TypeError('Parser must be a string or character stream, not '
---> 76                             '{itype}'.format(itype=instream.__class__.__name__))
     77 
     78         self.instream = instream

TypeError: Parser must be a string or character stream, not tuple
tsgit commented 6 years ago

a solution could be to separate repeated subfield x into separate MARC tags 961__ there's about 30 HepNames records with at least 2 $$x in the same tag

jacquerie commented 6 years ago

This is a data error, generated by an incorrect merge: it doesn't make sense for a record to have two creation dates. We can work around it by amending the rule to select the earliest date if more are present.