Open hejjoe opened 2 years ago
(This might be related to opensanctions/nomenklatura#69.)
I'm working through this slowly, it's just a lot o people. If you want to play, I'm happy to help you set it up, but it does give you RSI...
Yes, I want to play. :)
I want to make an OpenSanctions property in WD to link your data from WD (from the other side), see https://twitter.com/valexiev1/status/1514544484671016964.
I did some count on the biggest collection https://opensanctions.org/datasets/default/:
$ wc -l targets.simple.csv
196347
$ grep -cP '\bQ\d+\b' targets.simple.csv
112513
$ grep -cP '^"Q\d+"' openSanctions-default.csv
112503
Which 2 datasets are missing from this collection? (see opensanctions/opensanctions#199). Out of 204,469 total targets, this collection has 190,258 or 93%.
Which 2 datasets are missing from this collection?
OpenSanctions answered:
Regarding the OpenSanctions property for Wikidata, there's a bit of a discussion here already: https://www.wikidata.org/wiki/Wikidata:Property_proposal/OpenSanctions_ID - tl;dr: I think it makes more sense to add a property for OpenSanctions datasets, instead of the entity IDs.
The property was created as P10632:
Q111291907
Q27867848
Q5460278
Q221029421 "NK-e4oEwQ5TacyJJNCmUNsonN" # the Q number seems too long
Q081061 "NK-fXGAk6CgeGkxJ26PoEPHHH" # Q numbers don't start with 0
I'll also make a WD Mix-n-Match catalog with the remaining entries
@hejjoe @pudo If you're not familiar with https://mix-n-match.toolforge.org/#/, it is an excellent crowd-sourcing system for coreferencing entities to WD. It has maybe 4k catalogs (dumps or crawls of external databases) and has produced maybe 5M matches. Eg see ULAN, one of the crucial person databases in the arts: https://mix-n-match.toolforge.org/#/catalog/27
Breakdown of the remaining OS entries per type:
csvtk freq -f schema openSanctions-no-WD.csv
schema,frequency
Airplane,269
Vessel,415
Company,2455
CryptoWallet,7457
Organization,4335
LegalEntity,4581
Person,57754
What is LegalEntity
, is it "person or organization" but you don't know which? Some examples of LegalEntity
Eg if we look for Airplanes:
csvtk grep -f schema -p Airplane openSanctions-no-WD.csv|csvtk cut -f id,name,aliases,identifiers |head -20
id,name,aliases,identifiers
NK-27foGhde2c676yHk4Szsir,EX-301,,524;EX-301
NK-2Dm77ySwRh5BkuNFcKWd47,EP-MOF,,149;3149;EP-MOF
NK-2eLCszh7MxME37JiDXyhEu,MSN 550,,2-WGLP;550;MSN 550
NK-2oY2W562WbtAajN2bZsGrB,YV2964,,19000646;YV2964
NK-2q4YRztADsVsmYPEWLg4KU,EP-IBK,,671;EP-IBK
NK-2RcsM4umoACDaCAdAnWsg8,YV3033,,208B5140;YV3033
NK-2SkGGxw32P4qw6MnyXveKd,EP-ICE,,139;EP-ICE
NK-2XP4SpC6mZrEeRvNhaefCw,EP-CFM,,11394;EP-CFM
NK-3RboG5aQQdFse7aGYR6yvQ,YV3034,,208B5142;YV3034
NK-3s9cpVTo8iWdZ5v5nJEZZd,EP-ITE,,1424;EP-ITE
NK-3sqVBEnk3WWSZykfuJkhod,N488RC,,228;N488RC
NK-3sYzggNQbXwW7neddLEhaX,EP-MNV,,567;EP-MNV
NK-3TXRoCSy64gFjoXA9ztzgX,EP-MHF,,55;EP-MHF
NK-4ChbzQJpWyoyuSe7qkSmSk,YV2726,,136;YV2726
NK-4E2sKLvJYCFiDndeux4X3p,EP-MHA,,160;EP-MHA
NK-4eH4pto6F4XbDdVaUv9GtH,EP-IED,,345;EP-IED
NK-4hdbWLAqSnPHDWYzVvbFww,EP-ICF,,173;EP-ICF
NK-4jxr45VrPqoqRzyZbBStJt,UR-CKX,,131;3131;UR-CKX
NK-4mThxv7oHM2bJopxXhEWpb,YK-AGD,,1670;22360;YK-AGD
I checked the first 10 or so and they are not in WD, so IMHO are worth creating. However, I should work not with CSV but one of the 2 JSONs. Eg this plane https://www.opensanctions.org/entities/NK-27foGhde2c676yHk4Szsir/ is owned by MAHAN AIR, which is WD https://www.wikidata.org/wiki/Q1149762 (OS does not yet know that WD id, and MnM can find that coreference very easily).
We want the link airline-aircraft in WD. Which means we should import the remaining entities in tiers: first persons and orgs, then assets.
Country distribution:
csvtk freq -f countries openSanctions-no-WD.csv|sort -rn -t, -k2
,14988
ru,9062
tr,2540
cn,1917
ir,1853
pt,1721
by,1597
sv,1293
sy,1086
mx,1076
ua,1046
...
There are also many combinations, eg
af;bd;in;pk,1
af;az;ba;bd;ca;cn;gb;ge;hr;il;nl;pk;ps;ru;sa;sd;tj;us;ye,1
af;al;az;ba;bd;be;cn;er;et;fr;ge;in;iq;jo;lb;pk;ps;rs;ru;sl;so;sy;tr,1
ae;za,1
Russian entities:
csvtk grep -f countries -rp '.*ru.*' openSanctions-no-WD.csv|csvtk nrow
9448
csvtk grep -f countries -rp '.*ru.*' openSanctions-no-WD.csv|csvtk freq -f schema
Airplane,1
Vessel,40
Company,466
LegalEntity,438
Organization,580
Person,7923
Russian marine vessels (tankers, yachts etc):
csvtk grep -f countries -rp '.*ru.*' openSanctions-no-WD.csv|csvtk grep -f schema -p Vessel
eg
ofac-36401,Vessel,Lady Sevda,LADY SEVDA,,ru,,273342180;IMO 9683738;UBWL7,Program - SDN List - Block - Executive Order 14024;RUSSIA-EO14024,,,US OFAC Specially Designated Nationals (SDN) List;US Trade Consolidated Screening List (CSL),2022-04-06 18:24:16,2022-04-14 06:17:01
That's not on WD but is found by AIS: currently crossing from the Black Sea into Azov Sea:
FWIW this coverage should now be improved significantly, as I've been doing regular lookups. I just wish Wikidata had an API for search that let you search many aliases for an entity at the same time... that way the precision would be better yet.
@wetneb, can you give @pudo some advice?
https://github.com/opensanctions/opensanctions/issues/217#issuecomment-1225635310: Wikidata people are matching the Ukrainian list
I am sadly also not aware of such an API on Wikidata's side. That's not something that the reconciliation service I run offers either.
Asked in the telegram Wikidata group: https://t.me/wikidatachat/1406
Esp. "Politically Exposed Persons" should be in the dataset of wikidata.org.
So it should be possible transfer urls like
https://www.opensanctions.org/entities/eu-cor-2014776/ into https://www.opensanctions.org/entities/Q807599
or https://www.opensanctions.org/entities/NK-4vLiLeY59bCZraMeeB94kv/ into https://www.opensanctions.org/entities/Q208242