ropensci / patentsview

An R client to the PatentsView API
https://docs.ropensci.org/patentsview
Other
31 stars 9 forks source link

mvp for the new api version #27

Closed mustberuss closed 1 year ago

mustberuss commented 1 year ago

The patentsview API team has announced that they will be making changes. This is the first of a series of PRs that will begin to make the necessary changes to the R package. The API team is not done making the changes but changes to the R package can begin. The new and original versions of the API will coexist for a while to give users time to adapt. The endpoint's subdomains are distinct to allow this.

api-redesign.md has a link to the change announcement and lays out the affects on the R package. The "R Package Design Choices" section chronicles the decisions I made along the way and what code I changed.

Key changes with further details in api-redesign.md

  1. dplyr will be added as a dependency for use in some of the new tests not in this PR
  2. The Swagger definition is parsed rather than the html of the documentation to produce data-raw/fieldsdf.csv, which now has different columns than in the original version of the R package. data-raw/definition_extract.R will be in a later PR
  3. The API will throttle requests to 45 requests per minute. The R package will sleep when throttled and then resend the request, seemlessly to the user
  4. Casting is a bit different in the new version of the api- some but not all fields are returned in an appropriate native type
  5. An API key is required- the build relies on a repo secret PATENTSVIEW_API_KEY: ${{ secrets.PATENTSVIEW_API_KEY }} and the same as an environmental variable when running the updated R package

Additional changes that will impact users, with further details in api-redesign.md

  1. The overall paged result set size decreases to 10,000 from 100,000 and the per call request size decreases from 10,000 to 1,000
  2. Multiple calls may have to be made to retrieve the same data as was possible to retrieve in a single call from the original version of the API. This change is nicely reflected in the new version of the top assignees vignette not in this PR.
  3. Some field names have changed, some fields have changed from strings to full text and a decent amount of fields have gone away

It would have been nicer to include less in this PR but this is the MVP for the new api, leaving out anything wouldn't produce a functioning package. It's the minimal search-pv.R and test and anything that they need.

crew102 commented 1 year ago

Hi Russ -

First, thanks for all the work on this. Do you mind if we break things up a bit more in the PRs? I know you mentioned that it was difficult b/c there were some dependencies that kinda sprawled/required you to pull in the various pieces, but I'm hoping the changes I made to the target branch (api-redesign) will make it a bit more clear how we can break things up. Here's what I'm thinking:

mustberuss commented 1 year ago

Chris,

I have additional tests for the new functionality in search-pv but I created separate files for them - I can append them to test-search-pv.R and push that here.

crew102 commented 1 year ago

Hi Russ,

I was hoping we could put off casting stuff (including any tests involving cast_pv_data) in a second PR. I introduced the various calls to skip() in the tests file in my latest commit, so that only search_pv functionality is tested (e.g., https://github.com/ropensci/patentsview/commit/dbe2fa53bfcd00cdd71cbd9e37b9d776dbe486db#diff-bbe8caac4540139b923f518083cfc89e178b6a0d2201351b6eb68e1d0a08312e). You'll see that previously there was a call to skip_on_ci() in those files too, as well as in the search pv test file. I removed that call in the search pv file, so that the calls to the API are actually being made during the builds. I believe your builds were running with that skip line included, hence they would not have been going against the actual API (which is what we were doing previously, b/c I didn't want to pummel the API during builds, but now I'm thinking it will be worth it).

Yeah, I think if you just include whatever script you had to create the new fieldsdf object in data-raw, that will be fine. The convention as far as I know it is for data objects to be populated using scripts like this in a totally offline manner (i.e., not during builds), but we would like them to be reproducible by having that script in the data-raw folder.

I think we should put off additional functionality for search_pv (as well as their tests) in a separate PR. Does that work for you?

mustberuss commented 1 year ago

I pushed data-raw/definition_extract.R There's a lot there that I was hoping to save for a later PR!

I could revert throttling in search-pv, then the only changes would be related to the change in subdomains and additional endpoints, they're now singular, patent instead of patents, but this is hidden from package users. There are already additional changes coming in search-pv that I was planning on including in a separate PR. The builds are taking around 5 minutes so I assumed the time was taken running the test cases/calling the new version of the API.

crew102 commented 1 year ago

OK can you remove all file changes related to the non-core stuff I mentioned - e.g., all docs (assuming they're not required for a passing build), cast pv data stuff, etc...anything not related to basic api client + fields df creation. Also you'll want to remove those skip_on_ci lines being added in your commits, as they are resulting in tests not being run during your builds. If it's easier, I can just remove what I'm talking about and we can take it from there.

mustberuss commented 1 year ago

Feel free to take anything out. I deleted my secret and reran the builds successfully. You were right about the tests not being run.

We may need to add sleeps to the tests if I take out throttling. The API will send back a 429 if more than 45 requests per minute are made. The code I'll remove would sleep and then resend the query so the tests never knew they were throttled.

mustberuss commented 1 year ago

I added sleeps to test-search-pv.R after a build failed due to throttling, proving that the new API is being called on a push! The sleeps can be removed once the throttling code is added back to search-pv but at least the build is working again.

I think we're close to a base change set. There are a few stray changes made when I ran styler that could be reverted but the rest looks good to me.

crew102 commented 1 year ago

Hey Russ, I reworked your commits so that we only have changes related to a minimal working API client (e.g., no cast pv data stuff). See 9c0a7120274e57f9b0e102409254b12a2a57c696. This is the starting point for the PR. If you want, I can push this to your api-redesign-contribute branch if you give me contribute access. If you prefer a different method that's fine as well, I just want to make sure we're only introducing a minimal set of changes in this PR.

The build currently fails b/c we're running all examples (including those for cast_pv_data). I just commented out the run examples step. We'll add it back in the next PR.

mustberuss commented 1 year ago

Chris, I granted access, I've always wanted a contributor! I'm for anything that gets us to a place where you can begin.

thanks for this, Russ

crew102 commented 1 year ago

K cool, thanks. I see I spelled the branch name wrong when I first pushed. I fixed that. I'll look over things today and will start leaving feedback.

crew102 commented 1 year ago

Hey Russ, I didn't have much time to work on this today but I was able to get through the fieldsdf object creation code that you wrote in definition_extract.R. I'll push my code in a second here with some suggested refactorings. Here are a few notes/questions as well:

crew102 commented 1 year ago

Can you add in the retry code logic so that we start testing without having those sleeps in there?

mustberuss commented 1 year ago

Can you add in the retry code logic so that we start testing without having those sleeps in there?

0a2042f updated search-pv and the test for it. I also removed the sleeps in both places.

mustberuss commented 1 year ago
  • Do we know why non-integer numbers seem to be referred to as "number" in the swagger doc?

It's a Swagger 3 (aka OpenAPI) thing. From https://swagger.io/docs/specification/data-models/data-types/#numbers
OpenAPI has two numeric types, number and integer, where number includes both integer and floating-point numbers.

  • Related to the above, I saw the note you left re: strings vs full text. At least for now, I think we should reflect what the swagger doc is saying a field's data type is, with the idea being that it will eventually be correct there.

Swagger/OpenAPI only has strings, it doesn't have a concept of a full text field. See https://swagger.io/docs/specification/data-models/data-types/ I added the hardcoding since the set of operators used matters to the API, _text_any vs _contains for example. I didn't see anything that reliably indicated that the field is full text.

I have an undelivered test to show that we have the string/full text mappings correct, ie that they return results from a sample query (which is why I harvest the example field from the Swagger definition). The new version of the API doesn't throw an error if you use the wrong family of operators, it returns a 200 with no results if you used _contains when you should have used _text_any

  • I remember reading somewhere that additional endpoints are going to be added incrementally as the PatentsView team gets around to them. I think we should just include the endpoints that are currently supported. For example, a call to get_endpoints() should only return all currently active endpoints.

That's a good idea. locations isn't on the test server yet and there is supposed to be an applications endpoint coming by the end of the year. get-fields should be changed and the tests' line excluding the locations endpoint could be deleted.

  • I removed the cast_to field in fieldsdf, as I think we want to keep fields df as a user-facing object and the casting logic is more internal/related to implementation.

That's fine, it's a band aid until the API team fixes the integer fields that are currently returned as strings.

crew102 commented 1 year ago
mustberuss commented 1 year ago

I couldn't find anywhere that they specify the string/full text fields which is why I wrote the text case. Worse yet, in the new documentation they include screenshots (images) of the Swagger UI page and the nested objects are not expanded. (See the top set of endpoints, the original endpoints are listed under Legacy Endpoints)

As an important aside, I just noticed this on the new documentation page:

Our legacy API is set to be decommissioned at the end of 2022.

My test is test-alpha-types.R but it's not tied to anything in R directory. I guess it could be added to test-search-pv but it's 121 lines long. Not sure if there could be a test-fieldsdf.R or if it's ok to break the x.R / test-x.R pattern.

mustberuss commented 1 year ago

Chris,

I was just reviewing the code you delivered and man, it is clear that one of us is way better than the other at this. Thanks for the clean up and simplification.

There is an issue in that the Swagger/OpenAPI specification is written from the returned-data point of view, the API team just defines all of the endpoint's input parameters as strings. In a quirk of the new version of the API, there are a handful of fields that are requested with an _id in the name but they come back without it. Ex. assignee_id comes back as assignee. So in the API team's Swaggger definition, the field is assignee (the way it comes back) while in my copy it's assignee_id (the way the R package would need to sent it in the query and/or fields parameters). In other words, you would have to add more of your magic in fieldsdf.R to produce the csv from the API team's definition.

query <- qry_funs$eq(assignees_at_grant.assignee_id = "35")
example <-search_pv(query=query, fields=c("assignees_at_grant.assignee_id"))

head(example$data$patents$assignees_at_grant, 1)
# [[1]]
#                                            assignee
# 1 https://search.patentsview.org/api/v1/assignee/35/
crew102 commented 1 year ago

I'll take a look at throttling stuff and search-pv tonight.

mustberuss commented 1 year ago

I don't think they've put any effort into updating the spreadsheet. The last four tabs look reasonable (nber, uspc, and the two citations) but the rest have problems with a lot of the discontinued = FALSE rows actually have been discontinued. Also most of the entity_names (groups) seem to be wrong.

  • Re: id stuff, how many fields have this issue, and do you know if it's intended behavior from the PatentsView teams' perspective? If it's not a bug and they want to keep it like that, then yeah, I'd say we should add in logic to add those _ids to the relevant fields in fieldsdf.

It's probably on the order of a dozen fields. I'd guess it's intentional but there hasn't be any dialog with them. I reported a bunch of issues but they were never acknowledged.

I'll take a look at throttling stuff and search-pv tonight.

Great, thanks. In search-pv we just send requests and sleep if the API sends back a throttle response. It's a 429 with a header saying how many seconds to wait before requests will be allowed again. The request is resent after the sleep.

The test checks that a bulk request's reply matches a series of individual requests that should get throttled. In theory you could put the build jobs back to being in parallel and they should work. They'd throttle each other since they're all using the same API key but they should all succeed. It would be interesting to try at least once.