usc-isi-i2 / etk

Extraction Toolkit
https://etk.readthedocs.io/en/development/
MIT License
81 stars 48 forks source link

Date extractor issue for a particular case #391

Open elmaestro08 opened 5 years ago

elmaestro08 commented 5 years ago

For the date extraction of dates in the form '11/2018' and '12/2018' where the year can change the following code of date extraction returns no extractions:

event_date = doc.select_segments(jsonpath='$.Date') extractions = doc.extract(extractor=self.date_extractor, extractable=event_date[0])

In the above snippet 'Date' is the field of the document. The above code works for all the other months from '1/2018' to '10/2018' but typically fails for November and December.

elmaestro08 commented 5 years ago

The workaround I used to fix this,

event_date = doc.cdr_document.get('Date') event_date = '-'.join(event_date.split('/')[::-1]) extractions = self.date_extractor.extract(event_date)