Closed pudo closed 5 months ago
it can all go in a zip file, and we can host the zip file in a bucket or something. We have an example here which also downloads a zip file, unpacks the hundreds of files, then crawls them.
Data is from an API, though it does not seem to be documented. We can try to get as much JSON as possible following the HOWTO: https://data.hetq.am/en/howto
sounds like a good approach! Would be great to keep the backup script in the crawler directory too
Note that Armenian members of parliament are available here (in Armenian, Russian, English and French): http://parliament.am/deputies.php which is one of the primary sources of data.hetq.am
(that said, Hetq has put them into JSON format for us and added associated persons, etc)
(well, not actually JSON for the profile pages... unless they are generated server-side)
Yeah the associations are what we're after here. The parliamentarians themselves are probably outdated and we should definitely not derive role.pep
tags from this.
Ah, good to know - likewise, the assets, etc, are all pulled from http://cpcarmenia.am/hy/declarations-registry/ and also out of date.
I've downloaded all the person profiles and association graphs - they're here for the moment, I'll let you copy them to the OpenSanctions space: https://drive.google.com/file/d/1TZaLf0x8GeBxUrC50qGoGi5C9MNh0GSK/view?usp=sharing
Yeah the associations are what we're after here. The parliamentarians themselves are probably outdated and we should definitely not derive role.pep tags from this.
@pudo do you mean we should
role.rca
and let the graph analyzer add role.rca
?
role. pep
?I'd vote (1), but possibly mark every individual with a profile in this source as a poi
in their own right.
This web site: https://data.hetq.am/ contains a lot of useful PEP info. We have reason to believe it may be taken down at some point in the future. Please figure out how to make a backup of the site using
wget
and then parse the crawled copy, instead of crawling the live site.The rationale here is that we want to be able to change our scraper (e.g. because our ontology changes) without losing access to the data.
Alternatively, you can also crawl the web site data and turn it into a single large JSON file with all the profiles and connections and save that somewhere. The data is static.