mortada / fredapi

Python API for FRED (Federal Reserve Economic Data) and ALFRED (Archival FRED)
Apache License 2.0
930 stars 160 forks source link

Get dataframe #15

Open elmotec opened 9 years ago

elmotec commented 9 years ago

Hey @mortada,

This is a cumulative PR for the fixes to test_search, get_series changes, and the new get_dataframe. Don't know if you are interested and if so whether you'd like all in one commit (that's quite a bit) or in 3 commits as it is. The normal order of the pull requests are: 1- fix-test_search 2- get_series_with_realtime 3- get_dataframe

Cheers

mortada commented 9 years ago

@elmotec one high level comment is that it seems odd to have a function called get_dataframe()

the reason get_series() is called get_series is not because the returned type is a pandas Series, it is because it is getting data for an economic series

it also seems like it can potentially be quite tricky to try to fit different series data into the same index of a DataFrame. Maybe it's best for the user to decide on how to deal with different frequencies

elmotec commented 9 years ago

Fair enough. I guess we could call it get_multi_series() then to reflect the fact that one fetches multiple series in one call to Python (there are still multiple requests to the web API).

With respect to the join of multiple series, I have not pushed the concept much to be honest (it fits my smallish purpose) but pandas can handle quite a bit of data even if the join were to create a lot of rows (memory is usually the bottleneck when I had problems). I've added a frequency argument to force say a quarterly observation for series with a lot of data points.

Also, note that get_series() does not go away, so users can still aggregate series the way they want. The idea is really to have an easy way to retrieve multiple series in a pandas.DataFrame with just one line of code. I know I find value in that.

One last thing: I can see a challenge to backward compatibility as 9999-12-31 is now interpreted as None rather than literally. That was mostly to play nice with pandas.Timestamp format (which does not go that far in the future) so the data could be used in an index (datetime.datetime cannot), and also because no one really expects the series to be revised on 9999-12-31. None in that context is semantically equivalent to NULL for databases, so I thought it was a better choice than the max value of pandas.Timestamp.

mortada commented 9 years ago

pandas has a datetime null value called pd.NaT, would it make sense to do that instead of None?

elmotec commented 8 years ago

Right, right. Somehow I missed it even though pandas actually turns None into NaT. Thank you for pointing it.

What's more, checking it out made me realize that all the comparisons to NaT return false and that actually may not be what I want because things like df.query('rt_start < "2015-02-01" and "2015-02-01" < rt_end') don't return the expected result. Maybe I should change rt_end to pd.Timestamp.max for anything beyond pd.Timestamp.max.

elmotec commented 8 years ago

@mortada, So I made the replacement of 9999-12-31 a class variable. For now, I think it should default to pd.Timestamp.max to allow selection on the resulting dataframe like df[(df.rt_start < my_date) & (my_date < df.rt_end)]. I also renamed the method get_multi_series.