Closed pudo closed 9 months ago
I took a look at the code and there appears to be ids in the related_companies
properties for companies that do not exist. Here is the code I used:
import requests
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
r = requests.get("https://sanctions.nazk.gov.ua/api/v5/company/", headers=HEADERS)
companies = r.json()
all_companies_ids = [c['company_id'] for c in companies['data']]
for company in companies['data']:
for related_company in company['relations_company']:
if related_company not in all_companies_ids:
print("Company: {} is related to company {} that is not in the dataset".format(company["company_id"], related_company))
I think this will be a problem in this loop: https://github.com/opensanctions/opensanctions/blob/ca95c5960ab00802ffdcd18be6ba3b8b1fbe6716/datasets/ua/nabc_sanctions/crawler.py#L126
Because I don't think it is checking if the company exists.
@pudo the ids without an entity are from relation companies. it seems some of them are not in the sanction company dataset and if we try to query the api using their id we get no results. How do you suggest we proceed in this case?
I have a theory that we could chase down: what if the missing IDs in the data refer to companies that are in the extra lists mentioned in #72? If so, we should add these endpoints but make sure to assign a different topic and not do a h.make_sanction
on these extra lists...
Otherwise, perhaps we need to separate the parsing of companies and people into two phases, respectively: first, parse the base data and collect all the existing IDs, then second do another loop and create the relationships and emit only those where both IDs of the parties are known from the first phase? Would you be up for doing a PR, @atehe?
sure @pudo, will do that
Here is an example of a relationship:
And we see "organization":["ua-nazk-company-1818"], but in this dataset we cannot find an entity with id ua-nazk-company-1818 The same situation is, for example, here:
It is quite a common situation in this dataset I have a guess, maybe at some point there was a change in the way entity ids were generated, but those ids that are specified inside the relationships ("director":["..."]) remained unchanged