This repository is a collection of scripts to pre-process a dump Wikidata. All the code was tested with the latest-truthy dump of Wikidata (June 9th, 2024).
The first step is to apply some filters to the dump in order to reduce redundant or unnecessary triples
This filter is aimed to remove descriptions of Subjects/Objects in multiple languages, leaving only the English description
python remove_labels_and_descriptions.py < [input .nt file] > [output .nt file]
This filter removes all the properties not starting with
"<http://www.wikidata.org/prop/". As a byproduct, this filter generates a file
named removed_properties.txt
with all properties removed.
python remove_properties.py < [input .nt file] > [output .nt file]
This filter removes all the triples that generate a cycle in the graph. The script is intented to work only with
containment-related predicates, where a cycle can be considered as an error or a bug. However, the code is general enough
to be applied to other predicates. As a byproduct, this filter generates a file named removed_triples_cycle.txt
with all triples removed.
The list of predicates to be considered is given as an input file following the format
<predicate 1> <direction>
<predicate 2> <direction>
...
where <direction>
indicates how to interprete the triple SPO (0: S--P-->O, 1: O--P-->S). The direction is used, in particular, for predicates related with
the containment relation. For an example, check the file cycle_predicates.txt
python delete_cycles.py --input <input .nt file> --output <output .nt file> --subset-preds <.txt file with the subset of predicates>
To apply all the filters to the Wikidata dump latest-truthy.nt, use
cat latest-truthy.nt | python3 remove_labels_and_descriptions.py | python3 remove_properties.py > latest-truthy_filtered.nt
python delete_cycles.py --input latest-truthy_filtered.nt --output latest-truthy_filtered_nocycles.nt --subset-preds cycle_predicates.txt
Dataset | Number of triples |
---|---|
latest-truthy (original) | 8,254,120,518 |
After filter 1 | 2,276,362,123 |
After filter 2 | 1,617,500,079 (26 properties deleted) |
After filter 3 | 1,615,616,023 (1,884,056 triples deleted) |
The second step is to convert the filtered dataset into a new version using
continuous identifiers for the subject/objects and predicates. The output
dataset has extension .nt.dat
. Additionally, two dictionaries are generate to convert identifiers to entries of
the filtered dataset, one for subjects/objects (extensión .nt.dat.SO
) and one
for predicates (extension .nt.dat.P
)
python continuous_ids.py --input <input .nt file>
This script computes some stats from the input .nt file, as a graph. In particular, the computed stats are:
--max-deg
#Subset1
<predicate 1>
<predicate 2>
...
#Subset2
<predicate 1>
<predicate 2>
...
For an example, check the file sets_predicates.txt
python get_stats.py --input <input .nt file> --subset-preds <.txt file with the subset of predicates> --max-deg <limit degree>
P150
and
P131
(representing containment
relation) (filter 1 + filter 2): 14,520,899P150
and
P131
(representing containment
relation) (filter 1 + filter 2 + filter 3): 12,636,843P47
(representing adjacency relation): 919,701P171
,
P279
, P1647
and
P397
(representing containment
relation) (filter 1 + filter 2 + filter 3): 8,820,421Q13442814
: 41,928,868 (scholarly article -- article in an academic publication, usually peer reviewed)Q1860
: 14,149,995 (English -- West Germanic language)Q5
: 11,861,436 (human -- any member of Homo sapiens)Q1264450
: 8,081,235 (J2000.0 -- epoch in astronomy)Q6581097
: 6,851,060 (male -- to be used in "sex or gender" (P21) or "semantic gender" (P10339))1
: 5,603,026Q4167836
: 5,385,408 (Wikimedia category -- use with 'instance of' (P31) for Wikimedia category)2
: 4,818,8013
: 4,320,8944
: 3,918,407Q39790431
: 8,348 (BayGenomics: a resource of insertional mutations in mouse embryonic stem cells -- scientific article published on January 2003)Q6382438
: 6,704 (Shigella sonnei -- species of bacterium)Q213019
: 6,479 (The War of the Worlds -- 1898 novel by H. G. Wells)Q1644417
: 6,278 (Shigella flexneri -- species of bacterium)Q21600865
: 5,662 (Salmonella enterica subsp. enterica -- subspecies of bacterium)Q112113034
: 5,560 (Death following pulmonary complications of surgery before and during the SARS-CoV-2 pandemic -- scientific article published on 13 November 2021)Q56836084
: 5,480 (40 EASD Annual Meeting of the European Association for the Study of Diabetes : Munich, Germany, 5-9 September 2004 -- article)Q64022985
: 5,225 (Combinations of single-top-quark production cross-section measurements and |fLVVtb| determinations at sqrt(s) = 7 and 8 TeV with the ATLAS and CMS experiments -- article)Q21558717
: 5,200 (Combined Measurement of the Higgs Boson Mass in pp Collisions at sqrt(s)=7 and 8 TeV with the ATLAS and CMS Experiments -- scientific article )Q56754739
: 5,128 (Measurements of the Higgs boson production and decay rates and constraints on its couplings from a combined ATLAS and CMS analysis of the LHC pp collision data at sqrt(s)=7 and 8 TeV -- article)