Closed isaacOnline closed 3 months ago
Thanks for all this.
My concern with the USA Spending disambiguation (good catch on the error!) is that two different searches could produce a grant with the same ID and different funding amounts: that seems problematic.
The problem with using the generated_internal_id
is that with that, two different searches could return the same grant with a different ID, which is also bad (unless you mean to always use the internal ID, which also doesn't seem great -- good to capture the visible ID).
I guess I'm leaning towards going with your current solution, unless you have another idea.
(Michael: Isaac is working with Nic at UW)
Oh hmm that's true. What do you think about making a second request to the API based on the IDs of the encountered awards, to make sure that portions of them aren't being left out? The issue with this (beyond being a bit over-engineered) would be that we're including parts of awards that don't meet the search criteria in the first place, e.g. if they're outside the date range of the original search. But I do think it's better to treat them as a single award—The instances I've seen so far have all been repeated FAINs, which are supposed to be unique within an agency. So my thinking is that if part of the award meets the search criteria, the rest of the award should be treated as such too.
The Mellon website has changed, and the prior get_mellon
function no longer returns results. The most recent commit to this PR updates the function to use Mellon’s new GraphQL API. It has been adapted from the Python implementation written by @evamaxfield.
/grants/grants-database/grants/center-for-research-libraries/31600667/
for old implementation vs. /grant-details/11533
for new implementation), but these won't match due to the website change."to support an international initiative to preserve and provide access to non-Western, non-English language archival and library collections."
amount
) requires individual requests, which makes it fairly slow if there are a number of hits. Oh hmm that's true. What do you think about making a second request to the API based on the IDs of the encountered awards, to make sure that portions of them aren't being left out?
That's one additional API call per grant, yes? At what scale will that get us into the doghouse (API-wise) you think? Do they say anything.
:wave: Hey! Quick chime in / question regarding my implementation of:
Speed: Most of the information can be retrieved in batches, but one field (amount) requires individual requests, which makes it fairly slow if there are a number of hits.
We definitely can thread this to make it much faster. It's easy enough in Python, don't know about in R but the only reason I didn't was because I didn't want to hit API limits. From prior work do we know if they have API limits at all? Do we have a contact with their org to ask?
From prior work do we know if they have API limits at all? Do we have a contact with their org to ask?
Unfortunately not -- I'd be inclined to agree to not hit API's with multi-thread requests, i.e. prioritize politeness over speed.
That's one additional API call per grant, yes? At what scale will that get us into the doghouse (API-wise) you think? Do they say anything.
For USASpend we can submit a bulk request for multiple grants at once, so it would roughly double the amount of API calls. I haven't been able to find any documentation from them on limits, but that doesn't seem unreasonable to me.
get_nsf
was looping through results, there would be column mismatches in the results returned on different iterations. These would error out when the loop tried to join results together, so I added handling for this case. You can recreate the issue with this call: awardFindR::search_awards("semi-structured interviews", "nsf", "1998-04-14", "1998-04-15")
.@isaacOnline -- you'll let me know when you want me to review & merge here?
Hi @adam3smith! Yes, now ready for review.
The most recent changes are below.
@adam3smith I've reviewed can you merge?
Awesome, thanks!
@isaacOnline if you could check why the CI tests are failing? Somehow the vignette isn't building correctly -- I didn't look beyond that.
Contributions
get_nih
function to use the V2 API. V2 is mostly backward compatible, but required a slight change to the payload and response processing.get_usaspend
function was returning multiple awards with the same ID, which was causing an error whensearch_awards
tried to deduplicate records. These awards seemed to be cases where one agency was awarding multiple grants to the same project, but where the funds came from multiple sub agencies. I put info needed to recreate this issue below. The PR adds deduplication of award IDs within theget_usaspend
function to handle the problem. Alternatively, if it would be better to treat awards of this type as separate rather than to merge them, I'm happy to amend this to use thegenerated_internal_id
field from USASpending as the unique ID, rather thanAward ID
, which is not unique.link
field inget_ophil
(which is being used as an ID by.standardize_ophil
) appears to have been changed on their website. The PR updates the selector.USASpend Problem Example
Calling
get_usaspend(c("case study"), "2012-01-01", "2013-01-01", TRUE)
returns a data frame with two records for IDN000141110398
. Whensearch_awards(c("case study"), "usaspend", "2012-01-01", "2013-01-01", TRUE)
is called, it raises the following while deduplicating:Error in aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...) (main.R#87): no rows to aggregate
. The two awards, here and here were awarded for projects with the same title and other information (although different award amounts, different "Unique Award Key" values, and different awarding sub agencies.