Closed SeekPoint closed 3 years ago
version from 08-Jan-2019
takes 663 GB after unpacking (packaged version: about 32 GB )
Further this huge json is repackaged into message pack with some informations removed. It takes roughly about 300 GB. After whole preprocessing you are left with wikidata
folder that takes about 10 GB(this is language independent).
Also there's some preprocessing of wikipedia dump, but resulting size depends on size of wikipedia dump in given language.
To sum for preprocessing step requires a bit more than 1TB space, but for further steps 20-100 GB should be fine
(The decompression is optional but speeds up subsequent steps. In the past I tried to parse the files without decompressing, but this makes the rest of the pipeline slow. If disk space is an issue but you are patient, you can read the wikipedia dump directly)
It's up to around 1.1 TB now after unpacking, packaged version is around 51 GB.
As of April 2021, the uncompressed file is 1142GB
Please keep this issue open as its the #1 google search hit for that question
The dump from the 5th of May, 2021, takes 1234 GB.
As of July 1,2021, it takes over 1317 GB
If you want to decompress it, I would highly recommend to rename all properties. For example rename "wikibase-entityid" to "wid". You can reduce the total size by 30~50% by preprocessing the file with a simple python script (import bz2, multithread etc.) The bz2 compressed file goes from 70GB to 1400GB because there is a high redundancy in raw data, you can easily go back to 800GB by removing useless data and renaming properties.
On my SSD with a python script I go from ~1500 lines/second (compressed) to ~90.000 lines/second (uncompressed). So it's about ~15min to process wikidata entirely on disk without spark, index, or cache.
Seconding @Whiax's comments - lots of the bulk of the decompressed format is a lot of repeated field names and things of the sort. Any effort towards removing fields you won't need dramatically reduces the output file. It's possible to operate on the compress dump directly (but the speed penalty is quite big per @Whiax 's numbers)
it looks 500G are not enough