opensanctions / crawler-planning

Task tracking for the crawlers we're working on
https://github.com/orgs/opensanctions/projects/2
5 stars 0 forks source link

SPECIAL TASK: Hetq Armenia #8

Closed pudo closed 5 months ago

pudo commented 8 months ago

This web site: https://data.hetq.am/ contains a lot of useful PEP info. We have reason to believe it may be taken down at some point in the future. Please figure out how to make a backup of the site using wget and then parse the crawled copy, instead of crawling the live site.

The rationale here is that we want to be able to change our scraper (e.g. because our ontology changes) without losing access to the data.

Alternatively, you can also crawl the web site data and turn it into a single large JSON file with all the profiles and connections and save that somewhere. The data is static.

jbothma commented 7 months ago

it can all go in a zip file, and we can host the zip file in a bucket or something. We have an example here which also downloads a zip file, unpacks the hundreds of files, then crawls them.

dhdaines commented 5 months ago

Data is from an API, though it does not seem to be documented. We can try to get as much JSON as possible following the HOWTO: https://data.hetq.am/en/howto

jbothma commented 5 months ago

sounds like a good approach! Would be great to keep the backup script in the crawler directory too

dhdaines commented 5 months ago

Note that Armenian members of parliament are available here (in Armenian, Russian, English and French): http://parliament.am/deputies.php which is one of the primary sources of data.hetq.am

dhdaines commented 5 months ago

(that said, Hetq has put them into JSON format for us and added associated persons, etc)

(well, not actually JSON for the profile pages... unless they are generated server-side)

pudo commented 5 months ago

Yeah the associations are what we're after here. The parliamentarians themselves are probably outdated and we should definitely not derive role.pep tags from this.

dhdaines commented 5 months ago

Ah, good to know - likewise, the assets, etc, are all pulled from http://cpcarmenia.am/hy/declarations-registry/ and also out of date.

I've downloaded all the person profiles and association graphs - they're here for the moment, I'll let you copy them to the OpenSanctions space: https://drive.google.com/file/d/1TZaLf0x8GeBxUrC50qGoGi5C9MNh0GSK/view?usp=sharing

jbothma commented 5 months ago

Yeah the associations are what we're after here. The parliamentarians themselves are probably outdated and we should definitely not derive role.pep tags from this.

@pudo do you mean we should

  1. emit the parliamentarians as poi and relations without role.rca and let the graph analyzer add role.rca?
    1. should those who still meet the auto PEP criteria get role. pep?
  2. not add this directly to default, but rather to graph and enrich against it?
  3. something else?
pudo commented 5 months ago

I'd vote (1), but possibly mark every individual with a profile in this source as a poi in their own right.

jbothma commented 5 months ago

https://github.com/opensanctions/opensanctions/pull/786