pydata / pandas-datareader

Extract data from a wide range of Internet sources into a pandas DataFrame.
https://pydata.github.io/pandas-datareader/stable/index.html
Other
2.93k stars 681 forks source link

Propose more robust datetime to unix time conversion in yahoo/daily.py #378

Open shipmints opened 7 years ago

shipmints commented 7 years ago

Apologies as I'm new to git and github. I can try to make a fork/patch/pull request but I have to figure out how. In the meantime, I thought I'd offer the following to the local experts who surely can do this an order of magnitude faster than I can...

    def _get_params(self, symbol):
# I propose changing
        unix_start = int(time.mktime(self.start.timetuple()))
# to
        unix_start = (start - datetime.datetime(1970, 1, 1)).total_seconds()
# as it avoids mktime overflows
# e.g., try running the code with 1900/01/01 as the start range
shipmints commented 7 years ago

Corrected snippet to include self...

unix_start = (self.start - datetime.datetime(1970, 1, 1)).total_seconds()
stoffprof commented 7 years ago

This would resolve the error one gets when trying to pull data pre-1970 (on a Windows machine):

stdt = datetime(1950,1,1)
df = web.DataReader('^GSPC', 'yahoo', stdt)

     98     def _get_params(self, symbol):
---> 99         unix_start = int(time.mktime(self.start.timetuple()))
    100         day_end = self.end.replace(hour=23, minute=59, second=59)
    101         unix_end = int(time.mktime(day_end.timetuple()))

OverflowError: mktime argument out of range 

I think this would also work:

unix_start = int(calendar.timegm((self.start.timetuple()))
femtotrader commented 7 years ago

@shipmints Be aware that you can't substract timezone aware datetime and timezone naive datetime So it may fail if self.start is a timezone aware datetime (because your epoch is datetime.datetime(1970, 1, 1) which is a timezone naive datetime)

EPOCH should in fact be datetime.datetime(1970, 1, 1, tzinfo=pytz.UTC)

so we may wonder what will happen if self.start is timezone aware but not UTC?

What ever change in pandas-datareader code will be... we should have _sanitize_dates function in mind https://github.com/pydata/pandas-datareader/blob/331e3895c51326e87e69325e1a14f1e350cb832e/pandas_datareader/_utils.py#L18 Calculating difference between 2 datetime with differents timezone (according DST or not ...) may be a bit tricky (but some Python library can do this)

With Python 3.6 and Pandas 0.20 there is a very convenient method:

pd.Timestamp('1970-01-01', tz='UTC').timestamp()

unfortunately not with Python 2.7 and same Pandas version

femtotrader commented 7 years ago

A workaround may be to define (probably in _utils.py) this function

def _timestamp(ts):
    """
    Return unix timestamp for Pandas Timestamp ts
    See https://github.com/pandas-dev/pandas/issues/17329
    """
    try:
        return ts.timestamp()
    except AttributeError:
        return ts.value // 1000 / 1000000.0

With Python 3.6 (same result with Python 2.7)

In [161]: _timestamp(pd.Timestamp("1970-1-1", tz="UTC"))
Out[161]: 0.0

In [162]: _timestamp(pd.Timestamp("1970-1-1", tz="US/Central"))
Out[162]: 21600.0

In [163]: _timestamp(pd.Timestamp("1970-1-1", tz="Europe/Paris"))
Out[163]: -3600.0
femtotrader commented 7 years ago

but the problem of this code is that it returns a different value when no timezone is given

With Python 3.6 (my system timezone is Europe/Paris)

In [167]: _timestamp(pd.Timestamp("1970-1-1"))
Out[167]: -3600.0

with Python 2.7

In [43]: _timestamp(pd.Timestamp("1970-1-1"))
Out[43]: 0.0

Pinging @chris-b1 @gfyoung

femtotrader commented 7 years ago

Maybe

def _timestamp(ts):
    """
    Return unix timestamp for Timestamp ts
    See https://github.com/pandas-dev/pandas/issues/17329
    """
    try:
        return ts.timestamp()
    except AttributeError:
        timestamp = ts.value // 1000 / 1000000.0
        if ts.tz is None:
             timestamp += time.timezone
        return timestamp

In [36]: _timestamp(pd.Timestamp("1970-1-1", tz="UTC"))
Out[36]: 0.0

In [37]: _timestamp(pd.Timestamp("1970-1-1", tz="US/Central"))
Out[37]: 21600.0

In [38]: _timestamp(pd.Timestamp("1970-1-1", tz="Europe/Paris"))
Out[38]: -3600.0

In [69]: _timestamp(pd.Timestamp("1970-1-1"))  # use local timezone
Out[69]: -3600.0
stoffprof commented 7 years ago

@jreback Should this be considered a bug? Can't download pre-1970 data from Yahoo even when it is available (on Windows machines).

jreback commented 7 years ago

the patch by @femtotrader looks fine if u want to submit a PR with tests

bashtage commented 6 years ago

Yahoo has been deprecated.

stoffprof commented 5 years ago

I don't think this was ever implemented in yahoo, which is no longer deprecated. I can do a PR if needed. @femtotrader - were you suggesting that that _timestamp() be called within _sanitize_dates() on start and end and then using @shipmints solution of

unix_start = (start - datetime.datetime(1970, 1, 1)).total_seconds()

@bashtage - thoughts?

stoffprof commented 5 years ago

To be clear,

 pdr.get_data_yahoo('^GSPC', '1965')

raises a mktime argument out of range exception on Windows but not Mac.

Also, since yahoo only returns daily data, how important is the time zone issue? Seems like tz could safely be ignored.

bashtage commented 5 years ago

Doesn't seem like a high priority issue, but if you have some time and want to fix...

gliptak commented 5 years ago

Was this improved with https://github.com/pydata/pandas-datareader/pull/703 ?