micahjsmith / FredData.jl

Pull data from Federal Reserve Economic Data (FRED) directly into Julia
https://micahjsmith.github.io/FredData.jl/dev
Other
64 stars 19 forks source link

Real-time datasets #20

Open fipelle opened 3 years ago

fipelle commented 3 years ago

Hi,

I have written a small piece of code that generates multivariate real-time vintages merging FredData.jl output and unrevised data (stored in an Excel file). I am unsure on whether I should register a new package or open a pull request. Would you be open on the latter?

micahjsmith commented 3 years ago

Hi Filippo, thanks for the idea! I welcome contributions, yes please do open a pull request and I can work with you to see how we can past make this functionality available.

So if I understand correctly, a user would need to provide their own unrevised data from an external source (i.e. their own spreadsheet)? Could this data instead be constructed from one of the FRED API endpoints? One of the open issues in FredData.jl is to provide support for other endpoints (#13).

There is also a longstanding open issue to provide support for what my colleague has called "pseudo-vintages" (#11) and for which there is a linked MATLAB implementation. How close is this to what you are thinking of?

fipelle commented 3 years ago

Hi Micah,

So if I understand correctly, a user would need to provide their own unrevised data from an external source (i.e. their own spreadsheet)?

Yes, that's correct. Currently, the user must provide:

  1. an Excel file with external unrevised data;
  2. the respective release calendar (real-time or pseudo real-time).

I reckon this might not be the best way to implement it within a registered package. Indeed:

  1. external unrevised data might not be needed for specific applications;
  2. linking the package to some pre-specified Excel design is not ideal.

I think it might be best to implement it in such a way that:

  1. external unrevised data is optional;
  2. external unrevised data and calendar are considered as arguments of some function - and, thus, directly considered in some Julia Datatypes.

Could this data instead be constructed from one of the FRED API endpoints? One of the open issues in FredData.jl is to provide support for other endpoints (#13).

While it might work for data available on FRED, this might be limiting for users. For instance, quite a few interesting unrevised surveys / indices are not available on FRED.

There is also a longstanding open issue to provide support for what my colleague has called "pseudo-vintages" (#11) and for which there is a linked MATLAB implementation. How close is this to what you are thinking of?

It is not super far, even though the code is currently not supporting it. At the moment the code is creating two DataFrames (respectively from FRED and the external source described above), transforming the data when needed (for instance, to remove the effect of a change in the base year) and merging them together with an outer join on the release dates.

In order to allow for the pseudo-vintages, I suspect we would need to update the release dates column for the FRED DataFrame, using some external calendar. This should involve an additional keyword argument in the relevant function.


Ideally, I should be able to re-write what I have in the form of a small package in a few days and we can start from there. Given personal time constraints -- I am finishing my PhD thesis -- we could release a first version without the pseudo-vintages support soonish (in 1-2 weeks?) and work on the pseudo-vintages support at some point after the summer break.

fipelle commented 3 years ago

I forgot to ask: which branch should I fork?

micahjsmith commented 3 years ago

For development, please see a few notes here: https://micahjsmith.github.io/FredData.jl/dev/contributing/ Forking happens at the level of the entire repository; once you have created a fork, you can create a branch in your own copy of the repository with a short descriptive name.

micahjsmith commented 3 years ago

Okay, I think I have a better understanding now of the scope of what you propose. But also perhaps before/as you are getting started, you could share some sample real-time datasets with inputs/outputs you have created using this method? Can email me, attach files directly to an issue comment, or paste a subset of the rows into the issue comment code block.

I think the functionality of merging the FRED output with unrevised data and list of release dates sounds super useful. But I'm thinking that it might actually be too general-purpose of a routine for this package? The goal of FredData.jl is pretty narrowly to expose the functionality provided by the FRED API within Julia. So I'm thinking that what you propose may be best as (1) an example committed under /docs/src and shown in the FredData.jl documentation site or (2) a separate package. But perhaps I'd have a better understanding after seeing some sample inputs.

fipelle commented 3 years ago

Thanks. Will do! However, I need to write a simplified version of what I currently have first. I am using it for a series of specialised projects and it might be confusing as it is. It shouldn't take long though - just a few days.

But I'm thinking that it might actually be too general-purpose of a routine for this package?

While I agree in principle, I am not entirely convinced. At the end of the day, if you are working with real-time economic data and Julia, there is a high chance that you will also be looking into the FredData.jl routines first. Having the option of including external unrevised data (e.g., PMIs, stock price indices) into a real-time dataset would certainly be handy for researchers.

However, if you feel strongly it should not be included in FredData.jl, maybe creating a separate package might be best. We could name it in a way that recalls FredData.jl and consider it part of the FRED Data environment.

fipelle commented 3 years ago

I am sending you a JLD output with an array of data vintages and the release dates for each vintage via email. I have structured the data vintages as a DataFrame at the end, so that it should be easier to understand what's inside.