maxlath / wikibase-dump-filter

Filter and format a newline-delimited JSON stream of Wikibase entities
97 stars 15 forks source link

Filter purely by Q-Number #37

Open ajinnah opened 2 years ago

ajinnah commented 2 years ago

Hello,

I have a large list of wikidata id's or Q Numbers and I'd like to filter out purely these entities. Does this already exist/is this possible to implement?

Thank you!

maxlath commented 2 years ago

It's not implemented but could be done fairly easily with grep (which will be much faster, see documentation on prefiltering):

# Create a file with one id per line, matching dump lines start
echo "Q1
Q2
Q3" | awk '{print "^{\"type\":\"item\",\"id\":\"" $1 "\","}' > qid_filter

# Filter the dump with that shortlist of ids
cat latest-all.json.gz  | gzip -d | grep -E -f qid_filter | sed 's/,$//' > selected_entities.ndjson