sboysel / fredr

An R client for the Federal Reserve Economic Data (FRED) API
https://sboysel.github.io/fredr/
Other
92 stars 21 forks source link

Capture multiple series simultaneously? #99

Closed jsolson4 closed 3 years ago

jsolson4 commented 3 years ago

Greetings Sam,

Thank you for creating this package, it's been very helpful. I appreciate the vast selection of functions allowing one to search from categories to series, or get the tags for a series - it's very useful!

I am going to describe my current approach and process, then pose my question. I'm going to include a fair amount of detail, in the change that it's helpful to others. I am new to API use and the fredr package, so it may be that there is a much more succinct way to accomplish this. If that's the case, I'd be happy to learn how to navigate my interactions with the FRED API more efficiently.

So here we go, I've been looking to capture all macroeconomic variables related to the USA, with some additional filters (monthly reporting, at least 8 years of observations, and not discontinued). Ultimately I'm looking to capture as many variables as are relevant and then select the relevant ones using my modeling approach.

Most of the series selection processes I've observed are great for selecting one or two indicators. If one knows exactly what they're looking for, the interface is perfect. However, it was less clear to me what one should do to cast a wide net and get all variables of interest.

So, I learned that the structure of the FRED data starts with category IDs. Category IDs may have child category IDs. For example 'Business' could be the parent category to 'Small Business'. Category IDs will either have child category IDs or series underneath them. In our business example, a series under 'small business' could be 'number of employees'.

Wanting to capture as many relevant series as possible, started at the parent category ID (0) and iterated down each chain of category IDs until I got a master list of all category IDs. Then I iterated through the category IDs and captured the series IDs for all series meeting my search criteria. Now I am looking to iterate across the series and build a data frame with all of the results.

My question: is there a way to query all (or multiple) series in one request so I'm not pinging the FRED API many many times?

I'd also like to note one functionality that I think would be really useful. From my understanding 'tag_names'' in the fredr_category_series() function requires that the category search produces only the series meeting all the tag criteria. So if I wanted to find results for 'farm', 'house', and 'usa' and I wrote a search using 'farm;house;usa' then I would only return results which included all of those tags (probably almost no results). It'd be great to have the optionality to find series meeting any of those criteria.

Sincerely, Justin

jsolson4 commented 3 years ago

Looks like this may have been what I was looking for re capturing multiple series at once: https://github.com/sboysel/fredr/issues/16

DavisVaughan commented 3 years ago

Yea for the multiple series question I would just purrr::map_dfr(series_ids, fredr)

There is no way to bulk get series from FRED in a single API hit

DavisVaughan commented 3 years ago

similarly i think you can purrr::map_dfr(my_tags, ~fredr_category_series(my_category, tag_names = .x))

jsolson4 commented 3 years ago

Thank you, I was most curious about the bulk series download from a single API hit - so that answers my question.

I'd like to note that when running a bulk capture of data I would often hit the rate limit of 120 hits/min, which would prevent me from pinging the API again for 20 seconds. This would often occur even at runs to capture 10 series, where I am not sending a total of >=120 requests, so I assumed that the rate of requests itself was being detected. I could only assume that the loop I was using was iterating through at a rate faster than 120/min and spending a significant amount of time in delay.

To avoid this, I simply added a 'pause' function that will ensure the loop will not ping the API at a rate faster than 120 hits/min. It seems to have increase the speed of iteration. This does improve to appear my runtime when pulling many series, but I didn't do extensive profiling, so correct me if this doesn't make sense.

Define a Function to pause the for loop:

pause <- function(x){ p1 <- proc.time() Sys.sleep(x) proc.time() - p1 # The cpu usage should be negligible } # use ~ value slightly greater than 0.5 seconds

Here is the function I have been using to iterate through the series:

Define function to iterate through the series ID variables and return one long data.frame . IDs.toDF = function(ids){ bigdata = data.table() for (i in ids){ singleObs = fredr(i) bigdata = rbind(bigdata, singleObs) pause(0.53) # avoid hitting rate limit and incurring 20 sec. pause } bigdata = dcast(bigdata, date~series_id, fill = NA) return(bigdata) }

Run list of series and get results df = IDs.toDF(seriesIDs)

Above is the function I was using to iterate through the list of series and capture observations. I hadn't seen the purrr method before building this, but I like your purrr solution a lot. From a purely computational perspective it's obviously superior. I'm curious how you see the purrr solution playing out with the API rate limit. Does the vectorizion increase the frequency the API is being pinged and thus lead to delays?

sboysel commented 3 years ago

Thanks for taking the time to document this issue, @jsolson4.

You can pass a custom function to most of the purrr methods to induce timing of the requests to comply with the API limitations. Just wrap the logic inside your for loop a function and pass that function to the purrr method.