opensanctions / crawler-planning

Task tracking for the crawlers we're working on
https://github.com/orgs/opensanctions/projects/2
5 stars 0 forks source link

Lithuanian portal for asset declarations #115

Open pudo opened 3 months ago

pudo commented 3 months ago

Data URL

e.g. https://pinreg.vtek.lt/app/pid-perziura/635133

Publisher

https://vtek.lt/en/home/ / LT Ethics commission

Publisher country/territory code

Lithuania

Type of data

PEPs (Politicall Exposed Persons)

Coverage region

region:Europe

Can you tell us more?

JSON API backend, should be smooth to crawl.

This is a suggestion or request

kdeden commented 3 months ago

Taking a look at this. The external/deklaracijos/{ID} endpoint is easy enough. But the more general /external/deklaracijos route, which will get you the index of the most recently updated declarations, needs a captcha verification every few paginations. I'm guessing you want to avoid checking every possible deklaracijos id every time you refresh this data, and I'm also guessing you do not want to use a shady captcha solving service... Do you have any recommendations?

pudo commented 3 months ago

We often use caching to mitigate these kinds of blocks: context.fetch_json(...., cache_days=20) will basically buffer down the request count to a place where the API doesn't start blocking after a while.

kdeden commented 3 months ago

You're saying that caching the response will help us avoid Captchas? Or that we can brute-force every possible deklaracijos_ids and cache those results?

pudo commented 3 months ago

sorry - no I thought you were talking about rate limiting (which is what might trigger the captcha). if the captcha always shows up then iterating everything may be the only option. Can we define a reasonable range based on the IDs we do see?

kdeden commented 3 months ago

I see. Yes it always appears, if not during the initial request then in the first pagination. And sometimes more than once per session. I will do the following:

It would be a shame to give up on getting the most recently updated declarations, it makes the operation much more elegant.

kdeden commented 3 months ago

@dhdaines happy to have you take over, feel free to take a look at my initial progress

dhdaines commented 3 months ago

@dhdaines happy to have you take over, feel free to take a look at my initial progress

Oh! No, please continue, I saw that it was unassigned in the Projects page so I took it...

dhdaines commented 3 months ago

Ah, I see, it was moved back to "todo". Yes, I will look at your initial crawler then, thanks!

dhdaines commented 3 months ago

Actually, now that I look at this ... sorry ... no, I don't have time for this, I will de-assign it :(