Open elmotec opened 9 years ago
@elmotec one high level comment is that it seems odd to have a function called get_dataframe()
the reason get_series()
is called get_series
is not because the returned type is a pandas Series, it is because it is getting data for an economic series
it also seems like it can potentially be quite tricky to try to fit different series data into the same index of a DataFrame. Maybe it's best for the user to decide on how to deal with different frequencies
Fair enough. I guess we could call it get_multi_series()
then to reflect the fact that one fetches multiple series in one call to Python (there are still multiple requests to the web API).
With respect to the join of multiple series, I have not pushed the concept much to be honest (it fits my smallish purpose) but pandas can handle quite a bit of data even if the join were to create a lot of rows (memory is usually the bottleneck when I had problems). I've added a frequency argument to force say a quarterly observation for series with a lot of data points.
Also, note that get_series()
does not go away, so users can still aggregate series the way they want. The idea is really to have an easy way to retrieve multiple series in a pandas.DataFrame
with just one line of code. I know I find value in that.
One last thing: I can see a challenge to backward compatibility as 9999-12-31 is now interpreted as None rather than literally. That was mostly to play nice with pandas.Timestamp
format (which does not go that far in the future) so the data could be used in an index (datetime.datetime
cannot), and also because no one really expects the series to be revised on 9999-12-31. None
in that context is semantically equivalent to NULL
for databases, so I thought it was a better choice than the max value of pandas.Timestamp
.
pandas has a datetime null value called pd.NaT
, would it make sense to do that instead of None?
Right, right. Somehow I missed it even though pandas actually turns None into NaT. Thank you for pointing it.
What's more, checking it out made me realize that all the comparisons to NaT return false and that actually may not be what I want because things like df.query('rt_start < "2015-02-01" and "2015-02-01" < rt_end')
don't return the expected result. Maybe I should change rt_end to pd.Timestamp.max for anything beyond pd.Timestamp.max.
@mortada, So I made the replacement of 9999-12-31 a class variable. For now, I think it should default to pd.Timestamp.max to allow selection on the resulting dataframe like df[(df.rt_start < my_date) & (my_date < df.rt_end)]
. I also renamed the method get_multi_series.
Hey @mortada,
This is a cumulative PR for the fixes to test_search, get_series changes, and the new get_dataframe. Don't know if you are interested and if so whether you'd like all in one commit (that's quite a bit) or in 3 commits as it is. The normal order of the pull requests are: 1- fix-test_search 2- get_series_with_realtime 3- get_dataframe
Cheers