Closed adamn closed 8 years ago
Hello again @adamn, we appreciate your input. the parse
function has a very clear intent to only accept strings or unicodes as inputs. That's why, returning datetime
object as it is when encountered in the input is in contrast with the intent. Although this is well documented, it's not as obvious on the code side. Hence, PR #178. Thank you!
I have to disagree. You're forcing the user to put all sorts of logic into their code just to handle what should be a non-event. I can't see any reason why dateparser couldn't handle the datetime gracefully - it's trivial.
That's an interesting point of view, Adam though many won't agree. We always use recipes. Here's one which applies to your case:
>>> from dateparser import parse
>>> from datetime import datetime
>>> parse_date_or_datetime = lambda x: parse(x) if not isinstance(x, datetime) else x
>>> parse_date_or_datetime(datetime(2016, 1, 2, 10, 20))
datetime.datetime(2016, 1, 2, 10, 20)
Also, take a look here for a similar use case.
P.S. Zen of Python.
I can't imagine anybody would disagree with the request. I see no disadvantage to putting that lambda in the method itself.
Well, I dare to disagree.
If you have a device that turns cockroaches into zebras and you give it a zebra, would you rather get your shiny old zebra back and pretend like nothing happened, patting the device on its back for good work, or be notified that the device prefers cockroaches to work on? If you already have the zebra, why use the device?
IMO, @adamn 's request is not unreasonable.
pandas.to_datetime
for example accepts several input types:
>>> import pandas as pd
>>> from datetime import datetime
>>> pd.to_datetime('2016-05-12')
Timestamp('2016-05-12 00:00:00')
>>> pd.to_datetime(pd.to_datetime('2016-05-12'))
Timestamp('2016-05-12 00:00:00')
>>> pd.to_datetime(datetime.now())
Timestamp('2016-05-12 11:52:31.850405')
On the other hand, python-dateutil
also does not accept non-strings:
>>> from dateutil.parser import parse
>>> parse('2016-05-12')
datetime.datetime(2016, 5, 12, 0, 0)
>>> parse(parse('2016-05-12'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/paul/.virtualenvs/pandas/local/lib/python2.7/site-packages/dateutil/parser.py", line 1164, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "/home/paul/.virtualenvs/pandas/local/lib/python2.7/site-packages/dateutil/parser.py", line 552, in parse
res, skipped_tokens = self._parse(timestr, **kwargs)
File "/home/paul/.virtualenvs/pandas/local/lib/python2.7/site-packages/dateutil/parser.py", line 671, in _parse
l = _timelex.split(timestr) # Splits the timestr into tokens
File "/home/paul/.virtualenvs/pandas/local/lib/python2.7/site-packages/dateutil/parser.py", line 188, in split
return list(cls(s))
File "/home/paul/.virtualenvs/pandas/local/lib/python2.7/site-packages/dateutil/parser.py", line 184, in next
return self.__next__() # Python 2.x support
File "/home/paul/.virtualenvs/pandas/local/lib/python2.7/site-packages/dateutil/parser.py", line 177, in __next__
token = self.get_token()
File "/home/paul/.virtualenvs/pandas/local/lib/python2.7/site-packages/dateutil/parser.py", line 93, in get_token
nextchar = self.instream.read(1)
AttributeError: 'datetime.datetime' object has no attribute 'read'
maybe @adamn you could tell us in what context do you pass datetimes to dateparser? (like throwing a bunch of data to dateparser from different sources and normalizing them) and how having this in dateparser would save you trouble checking input types?
I'm all about doing one thing and doing it well ... but the one thing that I feel this library could do well is take an arbitrary input and turn it into a datetime. To take all arbitrary strings and unicodes but not a datetime seem abstruse. "Simple is better than complex." ... and the current situation is complex because it requires every user of the library to write boilerplate code that could easily be put into the library with no performance constraints and fewer lines of code than #178 (i.e. one if statement and a return).
Anyway, I'm taking feeds via feedreader and using dates from there (strings). If the date is not available, I use the publish_date from newspaper3k (datetime). If those aren't available, I start parsing other datetimes I find (strings) or simply now() (datetime).
I've made a corresponding ticket on dateutil - https://github.com/dateutil/dateutil/issues/269
On Thu, May 12, 2016 at 6:04 AM, Paul Tremberth notifications@github.com wrote:
IMO, @adamn https://github.com/adamn 's request is not unreasonable. pandas.to_datetime for example accepts several input types:
import pandas as pd from datetime import datetime pd.to_datetime('2016-05-12') Timestamp('2016-05-12 00:00:00') pd.to_datetime(pd.to_datetime('2016-05-12')) Timestamp('2016-05-12 00:00:00') pd.to_datetime(datetime.now()) Timestamp('2016-05-12 11:52:31.850405')
On the other hand, python-dateutil also does not accept non-strings:
from dateutil.parser import parse parse('2016-05-12') datetime.datetime(2016, 5, 12, 0, 0) parse(parse('2016-05-12')) Traceback (most recent call last): File "
", line 1, in File "/home/paul/.virtualenvs/pandas/local/lib/python2.7/site-packages/dateutil/parser.py", line 1164, in parse return DEFAULTPARSER.parse(timestr, _kwargs) File "/home/paul/.virtualenvs/pandas/local/lib/python2.7/site-packages/dateutil/parser.py", line 552, in parse res, skipped_tokens = self._parse(timestr, _kwargs) File "/home/paul/.virtualenvs/pandas/local/lib/python2.7/site-packages/dateutil/parser.py", line 671, in _parse l = _timelex.split(timestr) # Splits the timestr into tokens File "/home/paul/.virtualenvs/pandas/local/lib/python2.7/site-packages/dateutil/parser.py", line 188, in split return list(cls(s)) File "/home/paul/.virtualenvs/pandas/local/lib/python2.7/site-packages/dateutil/parser.py", line 184, in next return self.next() # Python 2.x support File "/home/paul/.virtualenvs/pandas/local/lib/python2.7/site-packages/dateutil/parser.py", line 177, in next token = self.get_token() File "/home/paul/.virtualenvs/pandas/local/lib/python2.7/site-packages/dateutil/parser.py", line 93, in get_token nextchar = self.instream.read(1) AttributeError: 'datetime.datetime' object has no attribute 'read' maybe @adamn https://github.com/adamn you could tell us in what context do you pass datetimes to dateparser? (like throwing a bunch of data to dateparser from different source and normalizing them) and how having this in dateparser would save you trouble checking input types?
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/scrapinghub/dateparser/issues/177#issuecomment-218714044
@adamn FYI, you may be interested in arrow. It is a higher level date and time library that tries to abstract away this sort of thing. arrow.get()
accepts anything date-like and returns an Arrow
object:
>>> import arrow
>>> import datetime
>>> from datetime import datetime
>>> arrow.get(datetime(2015, 1, 1))
<Arrow [2015-01-01T00:00:00+00:00]>
>>> arrow.get('2015-01-01')
<Arrow [2015-01-01T00:00:00+00:00]>
If you are looking for a general abstract date interface, that might be the right way to go.
A compromise might be to raise a UserWarning.
I'll check out arrow in the meantime.
@asadurski If I have a device that turns cockroaches into zebras and I give it a zebra, I would prefer to get back the zebra, not for the device to kill the zebra :-)
If a datetime object is passed to parse(), it should simply return it. Currently it throws an exception.