Open pudo opened 3 months ago
Taking a look at this. The external/deklaracijos/{ID}
endpoint is easy enough. But the more general /external/deklaracijos
route, which will get you the index of the most recently updated declarations, needs a captcha verification every few paginations.
I'm guessing you want to avoid checking every possible deklaracijos id every time you refresh this data, and I'm also guessing you do not want to use a shady captcha solving service...
Do you have any recommendations?
We often use caching to mitigate these kinds of blocks: context.fetch_json(...., cache_days=20)
will basically buffer down the request count to a place where the API doesn't start blocking after a while.
You're saying that caching the response will help us avoid Captchas? Or that we can brute-force every possible deklaracijos_ids and cache those results?
sorry - no I thought you were talking about rate limiting (which is what might trigger the captcha). if the captcha always shows up then iterating everything may be the only option. Can we define a reasonable range based on the IDs we do see?
I see. Yes it always appears, if not during the initial request then in the first pagination. And sometimes more than once per session. I will do the following:
Context
interface well with proxies?)It would be a shame to give up on getting the most recently updated declarations, it makes the operation much more elegant.
@dhdaines happy to have you take over, feel free to take a look at my initial progress
@dhdaines happy to have you take over, feel free to take a look at my initial progress
Oh! No, please continue, I saw that it was unassigned in the Projects page so I took it...
Ah, I see, it was moved back to "todo". Yes, I will look at your initial crawler then, thanks!
Actually, now that I look at this ... sorry ... no, I don't have time for this, I will de-assign it :(
Data URL
e.g. https://pinreg.vtek.lt/app/pid-perziura/635133
Publisher
https://vtek.lt/en/home/ / LT Ethics commission
Publisher country/territory code
Lithuania
Type of data
PEPs (Politicall Exposed Persons)
Coverage region
region:Europe
Can you tell us more?
JSON API backend, should be smooth to crawl.
This is a suggestion or request