ip-tools / uspto-opendata-python

A client library for accessing the USPTO Open Data APIs, written in Python.
https://docs.ip-tools.org/uspto-opendata-python/
MIT License
91 stars 22 forks source link

Synchronously download documents for multiple patent numbers #7

Closed rahul-gj closed 5 years ago

rahul-gj commented 5 years ago

I would like to know if I can download the list of the patent number or application number in synchronous mode. I can do that on https://ped.uspto.gov/peds/ by giving a coma separated values like '6583088, 6875727, 8697602, 6331531, 6274350, 10112906, 9491944, 9504251, 9137998'

This is because I think and tested also to find out that It's constant time operation whether you request one or 300 it will take the almost same time to complete the requests.

Something like:

from uspto.peds.client import UsptoPatentExaminationDataSystemClient
client = UsptoPatentExaminationDataSystemClient()

client.download_document(
    type='patent'
    numbers='6583088, 6875727, 8697602, 6331531, 6274350, 10112906, 9491944, 9504251, 9137998', # or list
)
amotl commented 5 years ago

Thanks Rahul, this sounds perfectly reasonable. However, I will have to sublimate myself into the topic again to find out how to make this possible.

If you see a way how to solve this, please let me know - even pull requests are welcome. Otherwise, please stay tuned or try to ping me again if you don't hear back from me.

With kind regards, Andreas.

amotl commented 5 years ago

Dear Rahul,

after having a short glimpse at this, I recognized that the Patent Number field actually is able to take such a comma-separated list of numbers already: image

I have to admit that I haven't been aware of that, so thanks for letting me know.

On the other hand, there's a remark in the footer area of the results page like

The Patent Examination Data system (PEDs) shows the first 20 results in the dataset. To see more results, click the "Request Download" link.

Based on this statement, I conclude that issuing 300 numbers there would not be possible at once and that the process would have to be chunked appropriately?

Thanks for your feedback already.

With kind regards, Andreas.

amotl commented 5 years ago

Dear Rahul,

the unassuming and quick fix f07ebefa makes uspto-search work again, which might just be what you wanted to achieve already. So, you might want to upgrade to uspto-opendata-python 0.8.3 and check one of the following examples.

Examples

Command line usage

uspto-peds search 'patentNumber:(6583088 6875727 8697602)'

API usage

from uspto.peds.client import UsptoPatentExaminationDataSystemClient
client = UsptoPatentExaminationDataSystemClient()
client.search('patentNumber:(6583088 6875727 8697602)')

Thanks again for recognizing this issue which has been surprisingly easy to resolve. Please let us know if this fulfills your needs.

With kind regards, Andreas.

rahul-gj commented 5 years ago

Sure I will do. Thanks for the quick fix.

rahul-gj commented 5 years ago

I have tested the new update. The data given by manual search is zip with year wise json files which is different that the data given by client.search but it's very fast and extensive. so for now It's fine. I will update the details soon.

I tried and think that this is quickfix and not the solution to the actual issue. The quickfix only solves the first step that is giving first 20 results. we should work torards getting all data i.e. package result which is similar to download_document.

amotl commented 5 years ago

The data given by manual search is zip with year wise json files which is different that the data given by client.search but it's very fast and extensive.

Yeah, the canonical download variant is "packaging" aka. "Zip Download" where most of the automation work of this library has been put into. As this is done asynchronously from the perspective of the client, this library has to poll the readymade archive resource for availability. Also, the archive baking takes some time on the server side.

The other method unlocked again through "search" is the direct JSON response offered by the API when searching with criteria. This is probably the data which is also displayed in the inline results lists (probably up to 20 hits only).

Both JSON output formats are completely different in their structure. Also, the "direct access" through the search response JSON is obviously also not available in XML format.

we should work torards getting all data i.e. package result

This has always been implemented as it was the main purpose of this library.

not the solution to the actual issue

I totally see your point. So a) On the one hand: Good that we fixed the issue with "search", but b) I don't see any obvious difference what a human would be doing when clicking on "Download package" and with the same thing implemented in Python when going down the "packaging" route.

Now, I'm feeling a bit lost here and also a bit sad that it's not obvious to me what your expectations are. Maybe you can help me to clarify things what this library does and how it could do better?

Thanks already, Andreas.

rahul-gj commented 5 years ago

I think This is great and sufficient. Thanks for the help. I really appreciate it.

amotl commented 5 years ago

Dear Rahul,

thanks for your feedback.

If you think this will be fine, then let's close this. Otherwise, I would really be interested to improve the performance of the downloading process. However, I currently don't see a way where exactly this would happen, i.e. how manual interaction would be faster at an point in comparison to the automated packaging and download process this library is implementing already.

With kind regards, Andreas.

rahul-gj commented 5 years ago

This can be closed as I can now do the searching and packaging after the update.

For my limited use, I have trimmed this library by forking. see https://github.com/rahul-gj/uspto-peds-python. Please see if I have not breached any license of anything.

Thanks

amotl commented 5 years ago

Dear Rahul,

I see what you have been aiming at. I think it is possible to have both variants implemented through code from the same repository and Python package and I might dedicate some time to merge your changes back to mainline in one way or another.

Until then, it is perfectly reasonable to have forks around like you did with your derivate uspto-peds-python. So, let's close this and track the note about the reintegration using a different ticket.

Thanks again for your valuable input and good to see that the barebone implementation of the PEDS Search API client wrapper purely based on the requests and BeautifulSoup packages is exactly what you have been aiming at and that you have been able to build that from parts of this library.

All the best and with kind regards, Andreas.

amotl commented 5 years ago

So, let's close this and track the note about the reintegration using a different ticket.

I just created #9 to be able to follow up on this later. Thanks again!