scrapinghub / dateparser

python parser for human readable dates
BSD 3-Clause "New" or "Revised" License
2.53k stars 463 forks source link

parse() throws exception when it receives a datetime #177

Closed adamn closed 8 years ago

adamn commented 8 years ago

If a datetime object is passed to parse(), it should simply return it. Currently it throws an exception.

>>> import datetime
>>> import dateparser
>>> dateparser.parse(datetime.datetime(2016, 4, 28, 0, 0))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/adamn1/.venvs/django/lib/python3.5/site-packages/dateparser/conf.py", line 80, in wrapper
    return f(*args, **kwargs)
  File "/Users/adamn1/.venvs/django/lib/python3.5/site-packages/dateparser/__init__.py", line 40, in parse
    data = parser.get_date_data(date_string, date_formats)
  File "/Users/adamn1/.venvs/django/lib/python3.5/site-packages/dateparser/date.py", line 339, in get_date_data
    date_string = date_string.strip()
AttributeError: 'datetime.datetime' object has no attribute 'strip'
>>>
waqasshabbir commented 8 years ago

Hello again @adamn, we appreciate your input. the parse function has a very clear intent to only accept strings or unicodes as inputs. That's why, returning datetime object as it is when encountered in the input is in contrast with the intent. Although this is well documented, it's not as obvious on the code side. Hence, PR #178. Thank you!

adamn commented 8 years ago

I have to disagree. You're forcing the user to put all sorts of logic into their code just to handle what should be a non-event. I can't see any reason why dateparser couldn't handle the datetime gracefully - it's trivial.

waqasshabbir commented 8 years ago

That's an interesting point of view, Adam though many won't agree. We always use recipes. Here's one which applies to your case:

>>> from dateparser import parse
>>> from datetime import datetime
>>> parse_date_or_datetime = lambda x: parse(x) if not isinstance(x, datetime) else x
>>> parse_date_or_datetime(datetime(2016, 1, 2, 10, 20))
datetime.datetime(2016, 1, 2, 10, 20)

Also, take a look here for a similar use case.

P.S. Zen of Python.

adamn commented 8 years ago

I can't imagine anybody would disagree with the request. I see no disadvantage to putting that lambda in the method itself.

asadurski commented 8 years ago

Well, I dare to disagree.

If you have a device that turns cockroaches into zebras and you give it a zebra, would you rather get your shiny old zebra back and pretend like nothing happened, patting the device on its back for good work, or be notified that the device prefers cockroaches to work on? If you already have the zebra, why use the device?

redapple commented 8 years ago

IMO, @adamn 's request is not unreasonable. pandas.to_datetime for example accepts several input types:

>>> import pandas as pd
>>> from datetime import datetime
>>> pd.to_datetime('2016-05-12')
Timestamp('2016-05-12 00:00:00')
>>> pd.to_datetime(pd.to_datetime('2016-05-12'))
Timestamp('2016-05-12 00:00:00')
>>> pd.to_datetime(datetime.now())
Timestamp('2016-05-12 11:52:31.850405')

On the other hand, python-dateutil also does not accept non-strings:

>>> from dateutil.parser import parse
>>> parse('2016-05-12')
datetime.datetime(2016, 5, 12, 0, 0)
>>> parse(parse('2016-05-12'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/paul/.virtualenvs/pandas/local/lib/python2.7/site-packages/dateutil/parser.py", line 1164, in parse
    return DEFAULTPARSER.parse(timestr, **kwargs)
  File "/home/paul/.virtualenvs/pandas/local/lib/python2.7/site-packages/dateutil/parser.py", line 552, in parse
    res, skipped_tokens = self._parse(timestr, **kwargs)
  File "/home/paul/.virtualenvs/pandas/local/lib/python2.7/site-packages/dateutil/parser.py", line 671, in _parse
    l = _timelex.split(timestr)         # Splits the timestr into tokens
  File "/home/paul/.virtualenvs/pandas/local/lib/python2.7/site-packages/dateutil/parser.py", line 188, in split
    return list(cls(s))
  File "/home/paul/.virtualenvs/pandas/local/lib/python2.7/site-packages/dateutil/parser.py", line 184, in next
    return self.__next__()  # Python 2.x support
  File "/home/paul/.virtualenvs/pandas/local/lib/python2.7/site-packages/dateutil/parser.py", line 177, in __next__
    token = self.get_token()
  File "/home/paul/.virtualenvs/pandas/local/lib/python2.7/site-packages/dateutil/parser.py", line 93, in get_token
    nextchar = self.instream.read(1)
AttributeError: 'datetime.datetime' object has no attribute 'read'

maybe @adamn you could tell us in what context do you pass datetimes to dateparser? (like throwing a bunch of data to dateparser from different sources and normalizing them) and how having this in dateparser would save you trouble checking input types?

adamn commented 8 years ago

I'm all about doing one thing and doing it well ... but the one thing that I feel this library could do well is take an arbitrary input and turn it into a datetime. To take all arbitrary strings and unicodes but not a datetime seem abstruse. "Simple is better than complex." ... and the current situation is complex because it requires every user of the library to write boilerplate code that could easily be put into the library with no performance constraints and fewer lines of code than #178 (i.e. one if statement and a return).

Anyway, I'm taking feeds via feedreader and using dates from there (strings). If the date is not available, I use the publish_date from newspaper3k (datetime). If those aren't available, I start parsing other datetimes I find (strings) or simply now() (datetime).

I've made a corresponding ticket on dateutil - https://github.com/dateutil/dateutil/issues/269

On Thu, May 12, 2016 at 6:04 AM, Paul Tremberth notifications@github.com wrote:

IMO, @adamn https://github.com/adamn 's request is not unreasonable. pandas.to_datetime for example accepts several input types:

import pandas as pd from datetime import datetime pd.to_datetime('2016-05-12') Timestamp('2016-05-12 00:00:00') pd.to_datetime(pd.to_datetime('2016-05-12')) Timestamp('2016-05-12 00:00:00') pd.to_datetime(datetime.now()) Timestamp('2016-05-12 11:52:31.850405')

On the other hand, python-dateutil also does not accept non-strings:

from dateutil.parser import parse parse('2016-05-12') datetime.datetime(2016, 5, 12, 0, 0) parse(parse('2016-05-12')) Traceback (most recent call last): File "", line 1, in File "/home/paul/.virtualenvs/pandas/local/lib/python2.7/site-packages/dateutil/parser.py", line 1164, in parse return DEFAULTPARSER.parse(timestr, _kwargs) File "/home/paul/.virtualenvs/pandas/local/lib/python2.7/site-packages/dateutil/parser.py", line 552, in parse res, skipped_tokens = self._parse(timestr, _kwargs) File "/home/paul/.virtualenvs/pandas/local/lib/python2.7/site-packages/dateutil/parser.py", line 671, in _parse l = _timelex.split(timestr) # Splits the timestr into tokens File "/home/paul/.virtualenvs/pandas/local/lib/python2.7/site-packages/dateutil/parser.py", line 188, in split return list(cls(s)) File "/home/paul/.virtualenvs/pandas/local/lib/python2.7/site-packages/dateutil/parser.py", line 184, in next return self.next() # Python 2.x support File "/home/paul/.virtualenvs/pandas/local/lib/python2.7/site-packages/dateutil/parser.py", line 177, in next token = self.get_token() File "/home/paul/.virtualenvs/pandas/local/lib/python2.7/site-packages/dateutil/parser.py", line 93, in get_token nextchar = self.instream.read(1) AttributeError: 'datetime.datetime' object has no attribute 'read'

maybe @adamn https://github.com/adamn you could tell us in what context do you pass datetimes to dateparser? (like throwing a bunch of data to dateparser from different source and normalizing them) and how having this in dateparser would save you trouble checking input types?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/scrapinghub/dateparser/issues/177#issuecomment-218714044

pganssle commented 8 years ago

@adamn FYI, you may be interested in arrow. It is a higher level date and time library that tries to abstract away this sort of thing. arrow.get() accepts anything date-like and returns an Arrow object:

>>> import arrow
>>> import datetime
>>> from datetime import datetime

>>> arrow.get(datetime(2015, 1, 1))
<Arrow [2015-01-01T00:00:00+00:00]>

>>> arrow.get('2015-01-01')
<Arrow [2015-01-01T00:00:00+00:00]>

If you are looking for a general abstract date interface, that might be the right way to go.

adamn commented 8 years ago

A compromise might be to raise a UserWarning.

I'll check out arrow in the meantime.

@asadurski If I have a device that turns cockroaches into zebras and I give it a zebra, I would prefer to get back the zebra, not for the device to kill the zebra :-)