Update nih, usaspend, ophil, mellon

isaacOnline commented 4 months ago

Contributions

The NIH V1 API has been deprecated since December of 2023. This PR updates the get_nih function to use the V2 API. V2 is mostly backward compatible, but required a slight change to the payload and response processing.
The get_usaspend function was returning multiple awards with the same ID, which was causing an error when search_awards tried to deduplicate records. These awards seemed to be cases where one agency was awarding multiple grants to the same project, but where the funds came from multiple sub agencies. I put info needed to recreate this issue below. The PR adds deduplication of award IDs within the get_usaspend function to handle the problem. Alternatively, if it would be better to treat awards of this type as separate rather than to merge them, I'm happy to amend this to use the generated_internal_id field from USASpending as the unique ID, rather than Award ID, which is not unique.
Similarly, the selector that is being used to create the link field in get_ophil (which is being used as an ID by .standardize_ophil) appears to have been changed on their website. The PR updates the selector.

USASpend Problem Example

Calling get_usaspend(c("case study"), "2012-01-01", "2013-01-01", TRUE) returns a data frame with two records for ID N000141110398. When search_awards(c("case study"), "usaspend", "2012-01-01", "2013-01-01", TRUE) is called, it raises the following while deduplicating: Error in aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...) (main.R#87): no rows to aggregate. The two awards, here and here were awarded for projects with the same title and other information (although different award amounts, different "Unique Award Key" values, and different awarding sub agencies.

adam3smith commented 4 months ago

Thanks for all this. My concern with the USA Spending disambiguation (good catch on the error!) is that two different searches could produce a grant with the same ID and different funding amounts: that seems problematic. The problem with using the generated_internal_id is that with that, two different searches could return the same grant with a different ID, which is also bad (unless you mean to always use the internal ID, which also doesn't seem great -- good to capture the visible ID).

I guess I'm leaning towards going with your current solution, unless you have another idea.

(Michael: Isaac is working with Nic at UW)

isaacOnline commented 4 months ago

Oh hmm that's true. What do you think about making a second request to the API based on the IDs of the encountered awards, to make sure that portions of them aren't being left out? The issue with this (beyond being a bit over-engineered) would be that we're including parts of awards that don't meet the search criteria in the first place, e.g. if they're outside the date range of the original search. But I do think it's better to treat them as a single award—The instances I've seen so far have all been repeated FAINs, which are supposed to be unique within an agency. So my thinking is that if part of the award meets the search criteria, the rest of the award should be treated as such too.

isaacOnline commented 4 months ago

Adding Mellon

The Mellon website has changed, and the prior get_mellon function no longer returns results. The most recent commit to this PR updates the function to use Mellon’s new GraphQL API. It has been adapted from the Python implementation written by @evamaxfield.

Notes on new implementation:

ID change: ID columns for results returned by the new and old implementation won't match. In both cases the IDs are the URL path for the grant (e.g. /grants/grants-database/grants/center-for-research-libraries/31600667/ for old implementation vs. /grant-details/11533 for new implementation), but these won't match due to the website change.
Now including an abstract: Previously, the abstract field was always NA. The new implementation uses the short description that appears under a grant as the abstract. An example abstract, for the grant linked above, is "to support an international initiative to preserve and provide access to non-Western, non-English language archival and library collections."
Speed: Most of the information can be retrieved in batches, but one field (amount) requires individual requests, which makes it fairly slow if there are a number of hits.
More hits: The new search appears to return a lot more results than the prior version. E.g. in the dataset downloaded in 2022 for Mellon as part of the qualitative-research-funding project, it looks like there were 68 hits for the phrase "archival research." In the new search, there are 1,563 hits for this phrase. FWIW this is a particularly bad case, and for other terms, like "ethnography", "diary", or "archival documents" the differences are non-existent or much smaller.

adam3smith commented 4 months ago

Oh hmm that's true. What do you think about making a second request to the API based on the IDs of the encountered awards, to make sure that portions of them aren't being left out?

That's one additional API call per grant, yes? At what scale will that get us into the doghouse (API-wise) you think? Do they say anything.

evamaxfield commented 4 months ago

:wave: Hey! Quick chime in / question regarding my implementation of:

Speed: Most of the information can be retrieved in batches, but one field (amount) requires individual requests, which makes it fairly slow if there are a number of hits.

We definitely can thread this to make it much faster. It's easy enough in Python, don't know about in R but the only reason I didn't was because I didn't want to hit API limits. From prior work do we know if they have API limits at all? Do we have a contact with their org to ask?

adam3smith commented 4 months ago

From prior work do we know if they have API limits at all? Do we have a contact with their org to ask?

Unfortunately not -- I'd be inclined to agree to not hit API's with multi-thread requests, i.e. prioritize politeness over speed.

isaacOnline commented 4 months ago

That's one additional API call per grant, yes? At what scale will that get us into the doghouse (API-wise) you think? Do they say anything.

For USASpend we can submit a bulk request for multiple grants at once, so it would roughly double the amount of API calls. I haven't been able to find any documentation from them on limits, but that doesn't seem unreasonable to me.

Other updates:

I found that occasionally when get_nsf was looping through results, there would be column mismatches in the results returned on different iterations. These would error out when the loop tried to join results together, so I added handling for this case. You can recreate the issue with this call: awardFindR::search_awards("semi-structured interviews", "nsf", "1998-04-14", "1998-04-15").
I also found that the USASpending endpoint we're using is unable to handle requests for dates prior to October of 2007. This was getting sent in a message that we weren't raising, so if you searched for data prior to then, it would just return no results. It looks like we may be able to get data prior to then using a bulk downloading endpoint, although I'm not sure if it will have the same fields available. For now, I'm just raising a warning.

adam3smith commented 3 months ago

@isaacOnline -- you'll let me know when you want me to review & merge here?

isaacOnline commented 3 months ago

Hi @adam3smith! Yes, now ready for review.

The most recent changes are below.

Added a 3 second delay between requests to Mellon's API. I was temporarily banned when not using a delay, so we do need to wait between requests. I chose 3 seconds to match the delays in NIH/NSF.
For usaspend, we're now performing a keyword search, then searching again directly for the IDs we found originally. This makes sure we aren't missing portions of awards given by sub-agencies that were somehow missed with the first search. It roughly doubles the number of API calls.
Added Gates back in (the payload needed a small edit).
Updated the payload for the new NIH API. The prior date filtering we were using was failing silently (it would just return all records that matched the keyword).
Now requiring an exact string match in NSF search.
Updated various tests/actions, e.g. changed the NIH test to perform a search that returns 500+ records instead of 20+, to ensure we're testing the loop in the NIH code

nniiicc commented 3 months ago

@adam3smith I've reviewed can you merge?

adam3smith commented 3 months ago

Awesome, thanks!

adam3smith commented 3 months ago

@isaacOnline if you could check why the CI tests are failing? Somehow the vignette isn't building correctly -- I didn't look beyond that.

ropensci / awardFindR