opensanctions / crawler-planning

Task tracking for the crawlers we're working on
https://github.com/orgs/opensanctions/projects/2
6 stars 0 forks source link

Fix: Ukraine NAZK crawler #71

Closed pudo closed 9 months ago

pudo commented 9 months ago

Here is an example of a relationship:

{"id":"ua-nazk-02e55a811a7dc28dbbe7e72929044ce6457c9659","caption":"підконтрольна компанія","schema":"Directorship","properties":{"role":["підконтрольна компанія"],"director":["NK-kT9GX3TyiCN3dkJJ67cTUr"],"organization":["ua-nazk-company-1818"]},"referents":[],"datasets":["ua_nabc_sanctions"],"first_seen":"2023-04-20T10:53:16","last_seen":"2024-02-07T06:30:01","last_change":"2023-04-20T10:53:16","target":false}

And we see "organization":["ua-nazk-company-1818"], but in this dataset we cannot find an entity with id ua-nazk-company-1818 The same situation is, for example, here:

{"id":"ua-nazk-05ec09a7291a641b1a03b40ad5df4fb147e804c8","caption":"підконтрольна компанія","schema":"Directorship","properties":{"organization":["ua-nazk-company-1770"],"director":["NK-kT9GX3TyiCN3dkJJ67cTUr"],"role":["підконтрольна компанія"]},"referents":[],"datasets":["ua_nabc_sanctions"],"first_seen":"2023-04-20T10:53:16","last_seen":"2024-02-07T06:30:01","last_change":"2023-04-20T10:53:16","target":false}

It is quite a common situation in this dataset I have a guess, maybe at some point there was a change in the way entity ids were generated, but those ids that are specified inside the relationships ("director":["..."]) remained unchanged

bgmello commented 9 months ago

I took a look at the code and there appears to be ids in the related_companies properties for companies that do not exist. Here is the code I used:

import requests

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
r = requests.get("https://sanctions.nazk.gov.ua/api/v5/company/", headers=HEADERS)

companies = r.json()
all_companies_ids = [c['company_id'] for c in companies['data']]

for company in companies['data']:
    for related_company in company['relations_company']:
        if related_company not in all_companies_ids:
            print("Company: {} is related to company {} that is not in the dataset".format(company["company_id"], related_company))

I think this will be a problem in this loop: https://github.com/opensanctions/opensanctions/blob/ca95c5960ab00802ffdcd18be6ba3b8b1fbe6716/datasets/ua/nabc_sanctions/crawler.py#L126

Because I don't think it is checking if the company exists.

atehe commented 9 months ago

@pudo the ids without an entity are from relation companies. it seems some of them are not in the sanction company dataset and if we try to query the api using their id we get no results. How do you suggest we proceed in this case?

pudo commented 9 months ago

I have a theory that we could chase down: what if the missing IDs in the data refer to companies that are in the extra lists mentioned in #72? If so, we should add these endpoints but make sure to assign a different topic and not do a h.make_sanction on these extra lists...

Otherwise, perhaps we need to separate the parsing of companies and people into two phases, respectively: first, parse the base data and collect all the existing IDs, then second do another loop and create the relationships and emit only those where both IDs of the parties are known from the first phase? Would you be up for doing a PR, @atehe?

atehe commented 9 months ago

sure @pudo, will do that