opensanctions / opensanctions

An open database of international sanctions data, persons of interest and politically exposed persons
https://www.opensanctions.org
MIT License
483 stars 116 forks source link

Use identifier from wikidata for more entries #198

Open hejjoe opened 2 years ago

hejjoe commented 2 years ago

Esp. "Politically Exposed Persons" should be in the dataset of wikidata.org.

So it should be possible transfer urls like

https://www.opensanctions.org/entities/eu-cor-2014776/ into https://www.opensanctions.org/entities/Q807599

or https://www.opensanctions.org/entities/NK-4vLiLeY59bCZraMeeB94kv/ into https://www.opensanctions.org/entities/Q208242

hejjoe commented 2 years ago

(This might be related to opensanctions/nomenklatura#69.)

pudo commented 2 years ago

I'm working through this slowly, it's just a lot o people. If you want to play, I'm happy to help you set it up, but it does give you RSI...

hejjoe commented 2 years ago

Yes, I want to play. :)

VladimirAlexiev commented 2 years ago

I want to make an OpenSanctions property in WD to link your data from WD (from the other side), see https://twitter.com/valexiev1/status/1514544484671016964.

I did some count on the biggest collection https://opensanctions.org/datasets/default/:

$ wc -l targets.simple.csv
196347 
$ grep -cP '\bQ\d+\b' targets.simple.csv
112513
$ grep -cP '^"Q\d+"' openSanctions-default.csv
112503

Which 2 datasets are missing from this collection? (see opensanctions/opensanctions#199). Out of 204,469 total targets, this collection has 190,258 or 93%.

VladimirAlexiev commented 2 years ago

Which 2 datasets are missing from this collection?

OpenSanctions answered:

pudo commented 2 years ago

Regarding the OpenSanctions property for Wikidata, there's a bit of a discussion here already: https://www.wikidata.org/wiki/Wikidata:Property_proposal/OpenSanctions_ID - tl;dr: I think it makes more sense to add a property for OpenSanctions datasets, instead of the entity IDs.

VladimirAlexiev commented 2 years ago

The property was created as P10632:

VladimirAlexiev commented 2 years ago

I'll also make a WD Mix-n-Match catalog with the remaining entries

@hejjoe @pudo If you're not familiar with https://mix-n-match.toolforge.org/#/, it is an excellent crowd-sourcing system for coreferencing entities to WD. It has maybe 4k catalogs (dumps or crawls of external databases) and has produced maybe 5M matches. Eg see ULAN, one of the crucial person databases in the arts: https://mix-n-match.toolforge.org/#/catalog/27

Breakdown of the remaining OS entries per type:

csvtk freq -f schema openSanctions-no-WD.csv
schema,frequency
Airplane,269
Vessel,415
Company,2455
CryptoWallet,7457
Organization,4335
LegalEntity,4581
Person,57754

What is LegalEntity, is it "person or organization" but you don't know which? Some examples of LegalEntity

VladimirAlexiev commented 2 years ago

Eg if we look for Airplanes:

csvtk grep -f schema -p Airplane openSanctions-no-WD.csv|csvtk cut -f id,name,aliases,identifiers |head -20
id,name,aliases,identifiers
NK-27foGhde2c676yHk4Szsir,EX-301,,524;EX-301
NK-2Dm77ySwRh5BkuNFcKWd47,EP-MOF,,149;3149;EP-MOF
NK-2eLCszh7MxME37JiDXyhEu,MSN 550,,2-WGLP;550;MSN 550
NK-2oY2W562WbtAajN2bZsGrB,YV2964,,19000646;YV2964
NK-2q4YRztADsVsmYPEWLg4KU,EP-IBK,,671;EP-IBK
NK-2RcsM4umoACDaCAdAnWsg8,YV3033,,208B5140;YV3033
NK-2SkGGxw32P4qw6MnyXveKd,EP-ICE,,139;EP-ICE
NK-2XP4SpC6mZrEeRvNhaefCw,EP-CFM,,11394;EP-CFM
NK-3RboG5aQQdFse7aGYR6yvQ,YV3034,,208B5142;YV3034
NK-3s9cpVTo8iWdZ5v5nJEZZd,EP-ITE,,1424;EP-ITE
NK-3sqVBEnk3WWSZykfuJkhod,N488RC,,228;N488RC
NK-3sYzggNQbXwW7neddLEhaX,EP-MNV,,567;EP-MNV
NK-3TXRoCSy64gFjoXA9ztzgX,EP-MHF,,55;EP-MHF
NK-4ChbzQJpWyoyuSe7qkSmSk,YV2726,,136;YV2726
NK-4E2sKLvJYCFiDndeux4X3p,EP-MHA,,160;EP-MHA
NK-4eH4pto6F4XbDdVaUv9GtH,EP-IED,,345;EP-IED
NK-4hdbWLAqSnPHDWYzVvbFww,EP-ICF,,173;EP-ICF
NK-4jxr45VrPqoqRzyZbBStJt,UR-CKX,,131;3131;UR-CKX
NK-4mThxv7oHM2bJopxXhEWpb,YK-AGD,,1670;22360;YK-AGD

I checked the first 10 or so and they are not in WD, so IMHO are worth creating. However, I should work not with CSV but one of the 2 JSONs. Eg this plane https://www.opensanctions.org/entities/NK-27foGhde2c676yHk4Szsir/ is owned by MAHAN AIR, which is WD https://www.wikidata.org/wiki/Q1149762 (OS does not yet know that WD id, and MnM can find that coreference very easily).

We want the link airline-aircraft in WD. Which means we should import the remaining entities in tiers: first persons and orgs, then assets.

VladimirAlexiev commented 2 years ago

Country distribution:

csvtk freq -f countries openSanctions-no-WD.csv|sort -rn -t, -k2
,14988
ru,9062
tr,2540
cn,1917
ir,1853
pt,1721
by,1597
sv,1293
sy,1086
mx,1076
ua,1046
...

There are also many combinations, eg

af;bd;in;pk,1
af;az;ba;bd;ca;cn;gb;ge;hr;il;nl;pk;ps;ru;sa;sd;tj;us;ye,1
af;al;az;ba;bd;be;cn;er;et;fr;ge;in;iq;jo;lb;pk;ps;rs;ru;sl;so;sy;tr,1
ae;za,1

Russian entities:

csvtk grep -f countries -rp '.*ru.*' openSanctions-no-WD.csv|csvtk nrow
9448

csvtk grep -f countries -rp '.*ru.*' openSanctions-no-WD.csv|csvtk freq -f schema
Airplane,1
Vessel,40
Company,466
LegalEntity,438
Organization,580
Person,7923

Russian marine vessels (tankers, yachts etc):

csvtk grep -f countries -rp '.*ru.*' openSanctions-no-WD.csv|csvtk grep -f schema -p Vessel
eg
ofac-36401,Vessel,Lady Sevda,LADY SEVDA,,ru,,273342180;IMO 9683738;UBWL7,Program - SDN List - Block - Executive Order 14024;RUSSIA-EO14024,,,US OFAC Specially Designated Nationals (SDN) List;US Trade Consolidated Screening List (CSL),2022-04-06 18:24:16,2022-04-14 06:17:01

That's not on WD but is found by AIS: currently crossing from the Black Sea into Azov Sea:

pudo commented 2 years ago

FWIW this coverage should now be improved significantly, as I've been doing regular lookups. I just wish Wikidata had an API for search that let you search many aliases for an entity at the same time... that way the precision would be better yet.

VladimirAlexiev commented 2 years ago

@wetneb, can you give @pudo some advice?

VladimirAlexiev commented 2 years ago

https://github.com/opensanctions/opensanctions/issues/217#issuecomment-1225635310: Wikidata people are matching the Ukrainian list

wetneb commented 2 years ago

I am sadly also not aware of such an API on Wikidata's side. That's not something that the reconciliation service I run offers either.

VladimirAlexiev commented 1 year ago

Asked in the telegram Wikidata group: https://t.me/wikidatachat/1406