ropensci / awardFindR

Scan multiple online grant databases for relevant awards
https://docs.ropensci.org/awardFindR
Other
16 stars 7 forks source link

Update nih, usaspend, ophil, mellon #42

Closed isaacOnline closed 3 months ago

isaacOnline commented 4 months ago

Contributions

USASpend Problem Example

Calling get_usaspend(c("case study"), "2012-01-01", "2013-01-01", TRUE) returns a data frame with two records for ID N000141110398. When search_awards(c("case study"), "usaspend", "2012-01-01", "2013-01-01", TRUE) is called, it raises the following while deduplicating: Error in aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...) (main.R#87): no rows to aggregate. The two awards, here and here were awarded for projects with the same title and other information (although different award amounts, different "Unique Award Key" values, and different awarding sub agencies.

adam3smith commented 4 months ago

Thanks for all this. My concern with the USA Spending disambiguation (good catch on the error!) is that two different searches could produce a grant with the same ID and different funding amounts: that seems problematic. The problem with using the generated_internal_id is that with that, two different searches could return the same grant with a different ID, which is also bad (unless you mean to always use the internal ID, which also doesn't seem great -- good to capture the visible ID).

I guess I'm leaning towards going with your current solution, unless you have another idea.

(Michael: Isaac is working with Nic at UW)

isaacOnline commented 4 months ago

Oh hmm that's true. What do you think about making a second request to the API based on the IDs of the encountered awards, to make sure that portions of them aren't being left out? The issue with this (beyond being a bit over-engineered) would be that we're including parts of awards that don't meet the search criteria in the first place, e.g. if they're outside the date range of the original search. But I do think it's better to treat them as a single award—The instances I've seen so far have all been repeated FAINs, which are supposed to be unique within an agency. So my thinking is that if part of the award meets the search criteria, the rest of the award should be treated as such too.

isaacOnline commented 4 months ago

Adding Mellon

The Mellon website has changed, and the prior get_mellon function no longer returns results. The most recent commit to this PR updates the function to use Mellon’s new GraphQL API. It has been adapted from the Python implementation written by @evamaxfield.

Notes on new implementation:

adam3smith commented 4 months ago

Oh hmm that's true. What do you think about making a second request to the API based on the IDs of the encountered awards, to make sure that portions of them aren't being left out?

That's one additional API call per grant, yes? At what scale will that get us into the doghouse (API-wise) you think? Do they say anything.

evamaxfield commented 4 months ago

:wave: Hey! Quick chime in / question regarding my implementation of:

Speed: Most of the information can be retrieved in batches, but one field (amount) requires individual requests, which makes it fairly slow if there are a number of hits.

We definitely can thread this to make it much faster. It's easy enough in Python, don't know about in R but the only reason I didn't was because I didn't want to hit API limits. From prior work do we know if they have API limits at all? Do we have a contact with their org to ask?

adam3smith commented 4 months ago

From prior work do we know if they have API limits at all? Do we have a contact with their org to ask?

Unfortunately not -- I'd be inclined to agree to not hit API's with multi-thread requests, i.e. prioritize politeness over speed.

isaacOnline commented 4 months ago

That's one additional API call per grant, yes? At what scale will that get us into the doghouse (API-wise) you think? Do they say anything.

For USASpend we can submit a bulk request for multiple grants at once, so it would roughly double the amount of API calls. I haven't been able to find any documentation from them on limits, but that doesn't seem unreasonable to me.

Other updates:

adam3smith commented 3 months ago

@isaacOnline -- you'll let me know when you want me to review & merge here?

isaacOnline commented 3 months ago

Hi @adam3smith! Yes, now ready for review.

The most recent changes are below.

nniiicc commented 3 months ago

@adam3smith I've reviewed can you merge?

adam3smith commented 3 months ago

Awesome, thanks!

adam3smith commented 3 months ago

@isaacOnline if you could check why the CI tests are failing? Somehow the vignette isn't building correctly -- I didn't look beyond that.