mvp for the new api version

mustberuss commented 1 year ago

The patentsview API team has announced that they will be making changes. This is the first of a series of PRs that will begin to make the necessary changes to the R package. The API team is not done making the changes but changes to the R package can begin. The new and original versions of the API will coexist for a while to give users time to adapt. The endpoint's subdomains are distinct to allow this.

api-redesign.md has a link to the change announcement and lays out the affects on the R package. The "R Package Design Choices" section chronicles the decisions I made along the way and what code I changed.

Key changes with further details in api-redesign.md

dplyr will be added as a dependency for use in some of the new tests not in this PR
The Swagger definition is parsed rather than the html of the documentation to produce data-raw/fieldsdf.csv, which now has different columns than in the original version of the R package. data-raw/definition_extract.R will be in a later PR
The API will throttle requests to 45 requests per minute. The R package will sleep when throttled and then resend the request, seemlessly to the user
Casting is a bit different in the new version of the api- some but not all fields are returned in an appropriate native type
An API key is required- the build relies on a repo secret PATENTSVIEW_API_KEY: ${{ secrets.PATENTSVIEW_API_KEY }} and the same as an environmental variable when running the updated R package

Additional changes that will impact users, with further details in api-redesign.md

The overall paged result set size decreases to 10,000 from 100,000 and the per call request size decreases from 10,000 to 1,000
Multiple calls may have to be made to retrieve the same data as was possible to retrieve in a single call from the original version of the API. This change is nicely reflected in the new version of the top assignees vignette not in this PR.
Some field names have changed, some fields have changed from strings to full text and a decent amount of fields have gone away

It would have been nicer to include less in this PR but this is the MVP for the new api, leaving out anything wouldn't produce a functioning package. It's the minimal search-pv.R and test and anything that they need.

crew102 commented 1 year ago

Hi Russ -

First, thanks for all the work on this. Do you mind if we break things up a bit more in the PRs? I know you mentioned that it was difficult b/c there were some dependencies that kinda sprawled/required you to pull in the various pieces, but I'm hoping the changes I made to the target branch (api-redesign) will make it a bit more clear how we can break things up. Here's what I'm thinking:

First, can we change the target branch from master to api-redesign?
In the first PR, can we make changes to search-pv, any utils files, and the data-raw R files (and subsequent fieldsdf R object) so that we introduce just a minimal client into things (e.g., no need to have the validate args stuff, check query stuff, cast pv data, query dal, etc. - all that can go into a separate PR). I commented out a few lines of code in search-pv and added a few calls to skip(), so that we can test this basic functionality in the first PR. Also please update the search-pv test file so that we have tests that cover whatever minimal functionality we're porting over to the new API in the first PR.
Can you include whatever changes you made to the data-raw R file(s), so that they can be reproduced in this PR?
We weren't originally running tests or example code during builds (e.g., skip_on_ci() was set), but I've changed that moving forward, given just how many changes are going to be introduced and how ever-changing the API is. I think having the automated checks run on Github will be necessary moving forward. I'm running a build that's actually going against the API right now. Presumably it's not using the secret I introduced in the yml build file, as we're still using the old version of the API. When you open up a PR to the api-redesign branch that goes against the new API, it will probably try to use the envvar for the API key and hence will fail b/c of the secret sharing issue we talked about, but as long as you have a build on your own repo that passes, we should be good (and also there'll be a build when we merge into api-redesign that will presumably use my secret with no problem).

mustberuss commented 1 year ago

Chris,

I changed the destination, in all the excitement I forgot to do that.
Casting is quite different now, the current casting test would likely fail with the new version of the csv without the code change. We could remove the test and revert to the previous casting code for now if you'd like.
I can include the new scraper but it isn't run as part of the build process. The new version of the csv and everything it needs is already include in this PR, the updated fieldsdf.rda etc. It is running against the new version of the API, changes in search-pv use the new version's submain and endpoints. Not sure if it's using your key or mine. When submitting this I left checked "Allow edits and access to secrets by maintainers".

I have additional tests for the new functionality in search-pv but I created separate files for them - I can append them to test-search-pv.R and push that here.

crew102 commented 1 year ago

Hi Russ,

I was hoping we could put off casting stuff (including any tests involving cast_pv_data) in a second PR. I introduced the various calls to skip() in the tests file in my latest commit, so that only search_pv functionality is tested (e.g., https://github.com/ropensci/patentsview/commit/dbe2fa53bfcd00cdd71cbd9e37b9d776dbe486db#diff-bbe8caac4540139b923f518083cfc89e178b6a0d2201351b6eb68e1d0a08312e). You'll see that previously there was a call to skip_on_ci() in those files too, as well as in the search pv test file. I removed that call in the search pv file, so that the calls to the API are actually being made during the builds. I believe your builds were running with that skip line included, hence they would not have been going against the actual API (which is what we were doing previously, b/c I didn't want to pummel the API during builds, but now I'm thinking it will be worth it).

Yeah, I think if you just include whatever script you had to create the new fieldsdf object in data-raw, that will be fine. The convention as far as I know it is for data objects to be populated using scripts like this in a totally offline manner (i.e., not during builds), but we would like them to be reproducible by having that script in the data-raw folder.

I think we should put off additional functionality for search_pv (as well as their tests) in a separate PR. Does that work for you?

mustberuss commented 1 year ago

I pushed data-raw/definition_extract.R There's a lot there that I was hoping to save for a later PR!

I could revert throttling in search-pv, then the only changes would be related to the change in subdomains and additional endpoints, they're now singular, patent instead of patents, but this is hidden from package users. There are already additional changes coming in search-pv that I was planning on including in a separate PR. The builds are taking around 5 minutes so I assumed the time was taken running the test cases/calling the new version of the API.

crew102 commented 1 year ago

OK can you remove all file changes related to the non-core stuff I mentioned - e.g., all docs (assuming they're not required for a passing build), cast pv data stuff, etc...anything not related to basic api client + fields df creation. Also you'll want to remove those skip_on_ci lines being added in your commits, as they are resulting in tests not being run during your builds. If it's easier, I can just remove what I'm talking about and we can take it from there.

mustberuss commented 1 year ago

Feel free to take anything out. I deleted my secret and reran the builds successfully. You were right about the tests not being run.

We may need to add sleeps to the tests if I take out throttling. The API will send back a 429 if more than 45 requests per minute are made. The code I'll remove would sleep and then resend the query so the tests never knew they were throttled.

mustberuss commented 1 year ago

I added sleeps to test-search-pv.R after a build failed due to throttling, proving that the new API is being called on a push! The sleeps can be removed once the throttling code is added back to search-pv but at least the build is working again.

I think we're close to a base change set. There are a few stray changes made when I ran styler that could be reverted but the rest looks good to me.

crew102 commented 1 year ago

Hey Russ, I reworked your commits so that we only have changes related to a minimal working API client (e.g., no cast pv data stuff). See 9c0a7120274e57f9b0e102409254b12a2a57c696. This is the starting point for the PR. If you want, I can push this to your api-redesign-contribute branch if you give me contribute access. If you prefer a different method that's fine as well, I just want to make sure we're only introducing a minimal set of changes in this PR.

The build currently fails b/c we're running all examples (including those for cast_pv_data). I just commented out the run examples step. We'll add it back in the next PR.

mustberuss commented 1 year ago

Chris, I granted access, I've always wanted a contributor! I'm for anything that gets us to a place where you can begin.

thanks for this, Russ

crew102 commented 1 year ago

K cool, thanks. I see I spelled the branch name wrong when I first pushed. I fixed that. I'll look over things today and will start leaving feedback.

crew102 commented 1 year ago

Hey Russ, I didn't have much time to work on this today but I was able to get through the fieldsdf object creation code that you wrote in definition_extract.R. I'll push my code in a second here with some suggested refactorings. Here are a few notes/questions as well:

Do we know why non-integer numbers seem to be referred to as "number" in the swagger doc? From the PatentsView API docs, the data type of "number" doesn't seem to exist (https://patentsview.org/apis/api-query-language), whereas float does. I left the data type as number for now in fieldsdf, but we can always substitute number -> float as you had it in your code. Presumably the swagger docs will be updated in the future with this fixed.
Related to the above, I saw the note you left re: strings vs full text. At least for now, I think we should reflect what the swagger doc is saying a field's data type is, with the idea being that it will eventually be correct there. We can reevaluate that decision after the API and its docs are a bit more stable.
I remember reading somewhere that additional endpoints are going to be added incrementally as the PatentsView team gets around to them. I think we should just include the endpoints that are currently supported. For example, a call to get_endpoints() should only return all currently active endpoints.
I removed the cast_to field in fieldsdf, as I think we want to keep fields df as a user-facing object and the casting logic is more internal/related to implementation.

crew102 commented 1 year ago

Can you add in the retry code logic so that we start testing without having those sleeps in there?

mustberuss commented 1 year ago

Can you add in the retry code logic so that we start testing without having those sleeps in there?

0a2042f updated search-pv and the test for it. I also removed the sleeps in both places.

mustberuss commented 1 year ago

Do we know why non-integer numbers seem to be referred to as "number" in the swagger doc?

It's a Swagger 3 (aka OpenAPI) thing. From https://swagger.io/docs/specification/data-models/data-types/#numbers
OpenAPI has two numeric types, number and integer, where number includes both integer and floating-point numbers.

Related to the above, I saw the note you left re: strings vs full text. At least for now, I think we should reflect what the swagger doc is saying a field's data type is, with the idea being that it will eventually be correct there.

Swagger/OpenAPI only has strings, it doesn't have a concept of a full text field. See https://swagger.io/docs/specification/data-models/data-types/ I added the hardcoding since the set of operators used matters to the API, _text_any vs _contains for example. I didn't see anything that reliably indicated that the field is full text.

I have an undelivered test to show that we have the string/full text mappings correct, ie that they return results from a sample query (which is why I harvest the example field from the Swagger definition). The new version of the API doesn't throw an error if you use the wrong family of operators, it returns a 200 with no results if you used _contains when you should have used _text_any

I remember reading somewhere that additional endpoints are going to be added incrementally as the PatentsView team gets around to them. I think we should just include the endpoints that are currently supported. For example, a call to get_endpoints() should only return all currently active endpoints.

That's a good idea. locations isn't on the test server yet and there is supposed to be an applications endpoint coming by the end of the year. get-fields should be changed and the tests' line excluding the locations endpoint could be deleted.

I removed the cast_to field in fieldsdf, as I think we want to keep fields df as a user-facing object and the casting logic is more internal/related to implementation.

That's fine, it's a band aid until the API team fixes the integer fields that are currently returned as strings.

crew102 commented 1 year ago

Re: numbers vs floats, k so we'll just replace number with float in fieldsdf.R
Re: string vs fulltext, any sense for how patentsivew will specify which fields can be queried like strings vs full text, given it's not supported by swagger? Surely they're going to document it somewhere...? If not, I'd say let's add back the extraction of "example" from the swagger spec (including adding it to fieldsdf) and also add in the test you mentioned re: having the string/fulltext distinction correct for fields.

mustberuss commented 1 year ago

I couldn't find anywhere that they specify the string/full text fields which is why I wrote the text case. Worse yet, in the new documentation they include screenshots (images) of the Swagger UI page and the nested objects are not expanded. (See the top set of endpoints, the original endpoints are listed under Legacy Endpoints)

As an important aside, I just noticed this on the new documentation page:

Our legacy API is set to be decommissioned at the end of 2022.

My test is test-alpha-types.R but it's not tied to anything in R directory. I guess it could be added to test-search-pv but it's 121 lines long. Not sure if there could be a test-fieldsdf.R or if it's ok to break the x.R / test-x.R pattern.

mustberuss commented 1 year ago

Chris,

I was just reviewing the code you delivered and man, it is clear that one of us is way better than the other at this. Thanks for the clean up and simplification.

There is an issue in that the Swagger/OpenAPI specification is written from the returned-data point of view, the API team just defines all of the endpoint's input parameters as strings. In a quirk of the new version of the API, there are a handful of fields that are requested with an _id in the name but they come back without it. Ex. assignee_id comes back as assignee. So in the API team's Swaggger definition, the field is assignee (the way it comes back) while in my copy it's assignee_id (the way the R package would need to sent it in the query and/or fields parameters). In other words, you would have to add more of your magic in fieldsdf.R to produce the csv from the API team's definition.

query <- qry_funs$eq(assignees_at_grant.assignee_id = "35")
example <-search_pv(query=query, fields=c("assignees_at_grant.assignee_id"))

head(example$data$patents$assignees_at_grant, 1)
# [[1]]
#                                            assignee
# 1 https://search.patentsview.org/api/v1/assignee/35/

crew102 commented 1 year ago

Re: how we'll distinguish between strings/fulltext, what are your thoughts on using the field update table (https://s3.amazonaws.com/data.patentsview.org/documents/API_Field_Statistics.xlsx), at least as a short-term solution? A more robust solution that actually queries the API to see if a field is actually stored as fulltext vs string (i.e., your approach) would be good to include down the road, assuming the docs aren't kept up to date.
Re: code quality, thanks, and glad to hear you're open to some refactoring.. I generally just copy patterns used by developers I respect, nothing fancier than that :)
Re: id stuff, how many fields have this issue, and do you know if it's intended behavior from the PatentsView teams' perspective? If it's not a bug and they want to keep it like that, then yeah, I'd say we should add in logic to add those _ids to the relevant fields in fieldsdf.

I'll take a look at throttling stuff and search-pv tonight.

mustberuss commented 1 year ago

Re: how we'll distinguish between strings/fulltext, what are your thoughts on using the field update table (https://s3.amazonaws.com/data.patentsview.org/documents/API_Field_Statistics.xlsx), at least as a short-term solution?

I don't think they've put any effort into updating the spreadsheet. The last four tabs look reasonable (nber, uspc, and the two citations) but the rest have problems with a lot of the discontinued = FALSE rows actually have been discontinued. Also most of the entity_names (groups) seem to be wrong.

Re: id stuff, how many fields have this issue, and do you know if it's intended behavior from the PatentsView teams' perspective? If it's not a bug and they want to keep it like that, then yeah, I'd say we should add in logic to add those _ids to the relevant fields in fieldsdf.

It's probably on the order of a dozen fields. I'd guess it's intentional but there hasn't be any dialog with them. I reported a bunch of issues but they were never acknowledged.

I'll take a look at throttling stuff and search-pv tonight.

Great, thanks. In search-pv we just send requests and sleep if the API sends back a throttle response. It's a 429 with a header saying how many seconds to wait before requests will be allowed again. The request is resent after the sleep.

The test checks that a bulk request's reply matches a series of individual requests that should get throttled. In theory you could put the build jobs back to being in parallel and they should work. They'd throttle each other since they're all using the same API key but they should all succeed. It would be interesting to try at least once.

ropensci / patentsview

mvp for the new api version #27