Closed mustberuss closed 2 years ago
Hi Russ -
First, thanks for all the work on this. Do you mind if we break things up a bit more in the PRs? I know you mentioned that it was difficult b/c there were some dependencies that kinda sprawled/required you to pull in the various pieces, but I'm hoping the changes I made to the target branch (api-redesign) will make it a bit more clear how we can break things up. Here's what I'm thinking:
skip()
, so that we can test this basic functionality in the first PR. Also please update the search-pv test file so that we have tests that cover whatever minimal functionality we're porting over to the new API in the first PR.Chris,
I have additional tests for the new functionality in search-pv but I created separate files for them - I can append them to test-search-pv.R and push that here.
Hi Russ,
I was hoping we could put off casting stuff (including any tests involving cast_pv_data) in a second PR. I introduced the various calls to skip()
in the tests file in my latest commit, so that only search_pv functionality is tested (e.g., https://github.com/ropensci/patentsview/commit/dbe2fa53bfcd00cdd71cbd9e37b9d776dbe486db#diff-bbe8caac4540139b923f518083cfc89e178b6a0d2201351b6eb68e1d0a08312e). You'll see that previously there was a call to skip_on_ci()
in those files too, as well as in the search pv test file. I removed that call in the search pv file, so that the calls to the API are actually being made during the builds. I believe your builds were running with that skip line included, hence they would not have been going against the actual API (which is what we were doing previously, b/c I didn't want to pummel the API during builds, but now I'm thinking it will be worth it).
Yeah, I think if you just include whatever script you had to create the new fieldsdf object in data-raw, that will be fine. The convention as far as I know it is for data objects to be populated using scripts like this in a totally offline manner (i.e., not during builds), but we would like them to be reproducible by having that script in the data-raw folder.
I think we should put off additional functionality for search_pv (as well as their tests) in a separate PR. Does that work for you?
I pushed data-raw/definition_extract.R There's a lot there that I was hoping to save for a later PR!
I could revert throttling in search-pv, then the only changes would be related to the change in subdomains and additional endpoints, they're now singular, patent instead of patents, but this is hidden from package users. There are already additional changes coming in search-pv that I was planning on including in a separate PR. The builds are taking around 5 minutes so I assumed the time was taken running the test cases/calling the new version of the API.
OK can you remove all file changes related to the non-core stuff I mentioned - e.g., all docs (assuming they're not required for a passing build), cast pv data stuff, etc...anything not related to basic api client + fields df creation. Also you'll want to remove those skip_on_ci lines being added in your commits, as they are resulting in tests not being run during your builds. If it's easier, I can just remove what I'm talking about and we can take it from there.
Feel free to take anything out. I deleted my secret and reran the builds successfully. You were right about the tests not being run.
We may need to add sleeps to the tests if I take out throttling. The API will send back a 429 if more than 45 requests per minute are made. The code I'll remove would sleep and then resend the query so the tests never knew they were throttled.
I added sleeps to test-search-pv.R after a build failed due to throttling, proving that the new API is being called on a push! The sleeps can be removed once the throttling code is added back to search-pv but at least the build is working again.
I think we're close to a base change set. There are a few stray changes made when I ran styler that could be reverted but the rest looks good to me.
Hey Russ, I reworked your commits so that we only have changes related to a minimal working API client (e.g., no cast pv data stuff). See 9c0a7120274e57f9b0e102409254b12a2a57c696. This is the starting point for the PR. If you want, I can push this to your api-redesign-contribute branch if you give me contribute access. If you prefer a different method that's fine as well, I just want to make sure we're only introducing a minimal set of changes in this PR.
The build currently fails b/c we're running all examples (including those for cast_pv_data). I just commented out the run examples step. We'll add it back in the next PR.
Chris, I granted access, I've always wanted a contributor! I'm for anything that gets us to a place where you can begin.
thanks for this, Russ
K cool, thanks. I see I spelled the branch name wrong when I first pushed. I fixed that. I'll look over things today and will start leaving feedback.
Hey Russ, I didn't have much time to work on this today but I was able to get through the fieldsdf object creation code that you wrote in definition_extract.R. I'll push my code in a second here with some suggested refactorings. Here are a few notes/questions as well:
Can you add in the retry code logic so that we start testing without having those sleeps in there?
Can you add in the retry code logic so that we start testing without having those sleeps in there?
0a2042f updated search-pv and the test for it. I also removed the sleeps in both places.
- Do we know why non-integer numbers seem to be referred to as "number" in the swagger doc?
It's a Swagger 3 (aka OpenAPI) thing. From https://swagger.io/docs/specification/data-models/data-types/#numbers
OpenAPI has two numeric types, number and integer, where number includes both integer and floating-point numbers.
- Related to the above, I saw the note you left re: strings vs full text. At least for now, I think we should reflect what the swagger doc is saying a field's data type is, with the idea being that it will eventually be correct there.
Swagger/OpenAPI only has strings, it doesn't have a concept of a full text field. See https://swagger.io/docs/specification/data-models/data-types/ I added the hardcoding since the set of operators used matters to the API, _text_any vs _contains for example. I didn't see anything that reliably indicated that the field is full text.
I have an undelivered test to show that we have the string/full text mappings correct, ie that they return results from a sample query (which is why I harvest the example field from the Swagger definition). The new version of the API doesn't throw an error if you use the wrong family of operators, it returns a 200 with no results if you used _contains when you should have used _text_any
- I remember reading somewhere that additional endpoints are going to be added incrementally as the PatentsView team gets around to them. I think we should just include the endpoints that are currently supported. For example, a call to get_endpoints() should only return all currently active endpoints.
That's a good idea. locations isn't on the test server yet and there is supposed to be an applications endpoint coming by the end of the year. get-fields should be changed and the tests' line excluding the locations endpoint could be deleted.
- I removed the cast_to field in fieldsdf, as I think we want to keep fields df as a user-facing object and the casting logic is more internal/related to implementation.
That's fine, it's a band aid until the API team fixes the integer fields that are currently returned as strings.
I couldn't find anywhere that they specify the string/full text fields which is why I wrote the text case. Worse yet, in the new documentation they include screenshots (images) of the Swagger UI page and the nested objects are not expanded. (See the top set of endpoints, the original endpoints are listed under Legacy Endpoints)
As an important aside, I just noticed this on the new documentation page:
Our legacy API is set to be decommissioned at the end of 2022.
My test is test-alpha-types.R but it's not tied to anything in R directory. I guess it could be added to test-search-pv but it's 121 lines long. Not sure if there could be a test-fieldsdf.R or if it's ok to break the x.R / test-x.R pattern.
Chris,
I was just reviewing the code you delivered and man, it is clear that one of us is way better than the other at this. Thanks for the clean up and simplification.
There is an issue in that the Swagger/OpenAPI specification is written from the returned-data point of view, the API team just defines all of the endpoint's input parameters as strings. In a quirk of the new version of the API, there are a handful of fields that are requested with an _id in the name but they come back without it. Ex. assignee_id comes back as assignee. So in the API team's Swaggger definition, the field is assignee (the way it comes back) while in my copy it's assignee_id (the way the R package would need to sent it in the query and/or fields parameters). In other words, you would have to add more of your magic in fieldsdf.R to produce the csv from the API team's definition.
query <- qry_funs$eq(assignees_at_grant.assignee_id = "35")
example <-search_pv(query=query, fields=c("assignees_at_grant.assignee_id"))
head(example$data$patents$assignees_at_grant, 1)
# [[1]]
# assignee
# 1 https://search.patentsview.org/api/v1/assignee/35/
I'll take a look at throttling stuff and search-pv tonight.
- Re: how we'll distinguish between strings/fulltext, what are your thoughts on using the field update table (https://s3.amazonaws.com/data.patentsview.org/documents/API_Field_Statistics.xlsx), at least as a short-term solution?
I don't think they've put any effort into updating the spreadsheet. The last four tabs look reasonable (nber, uspc, and the two citations) but the rest have problems with a lot of the discontinued = FALSE rows actually have been discontinued. Also most of the entity_names (groups) seem to be wrong.
- Re: id stuff, how many fields have this issue, and do you know if it's intended behavior from the PatentsView teams' perspective? If it's not a bug and they want to keep it like that, then yeah, I'd say we should add in logic to add those _ids to the relevant fields in fieldsdf.
It's probably on the order of a dozen fields. I'd guess it's intentional but there hasn't be any dialog with them. I reported a bunch of issues but they were never acknowledged.
I'll take a look at throttling stuff and search-pv tonight.
Great, thanks. In search-pv we just send requests and sleep if the API sends back a throttle response. It's a 429 with a header saying how many seconds to wait before requests will be allowed again. The request is resent after the sleep.
The test checks that a bulk request's reply matches a series of individual requests that should get throttled. In theory you could put the build jobs back to being in parallel and they should work. They'd throttle each other since they're all using the same API key but they should all succeed. It would be interesting to try at least once.
The patentsview API team has announced that they will be making changes. This is the first of a series of PRs that will begin to make the necessary changes to the R package. The API team is not done making the changes but changes to the R package can begin. The new and original versions of the API will coexist for a while to give users time to adapt. The endpoint's subdomains are distinct to allow this.
api-redesign.md has a link to the change announcement and lays out the affects on the R package. The "R Package Design Choices" section chronicles the decisions I made along the way and what code I changed.
Key changes with further details in api-redesign.md
Additional changes that will impact users, with further details in api-redesign.md
It would have been nicer to include less in this PR but this is the MVP for the new api, leaving out anything wouldn't produce a functioning package. It's the minimal search-pv.R and test and anything that they need.