how much disk for decompress the file latest-all.json.bz2

openai / deeptype

Code for the paper "DeepType: Multilingual Entity Linking by Neural Type System Evolution"

https://arxiv.org/abs/1802.01021

Other

651 stars 147 forks source link

how much disk for decompress the file latest-all.json.bz2 #47

Closed SeekPoint closed 3 years ago

SeekPoint commented 5 years ago

it looks 500G are not enough

mflis commented 5 years ago

version from 08-Jan-2019 takes 663 GB after unpacking (packaged version: about 32 GB )

mflis commented 5 years ago

Further this huge json is repackaged into message pack with some informations removed. It takes roughly about 300 GB. After whole preprocessing you are left with wikidata folder that takes about 10 GB(this is language independent).

Also there's some preprocessing of wikipedia dump, but resulting size depends on size of wikipedia dump in given language.

To sum for preprocessing step requires a bit more than 1TB space, but for further steps 20-100 GB should be fine

JonathanRaiman commented 5 years ago

(The decompression is optional but speeds up subsequent steps. In the past I tried to parse the files without decompressing, but this makes the rest of the pipeline slow. If disk space is an issue but you are patient, you can read the wikipedia dump directly)

norn93 commented 4 years ago

It's up to around 1.1 TB now after unpacking, packaged version is around 51 GB.

ghost commented 3 years ago

As of April 2021, the uncompressed file is 1142GB

Please keep this issue open as its the #1 google search hit for that question

dmitsf commented 3 years ago

The dump from the 5th of May, 2021, takes 1234 GB.

dadelani commented 3 years ago

As of July 1,2021, it takes over 1317 GB

Whiax commented 2 years ago

If you want to decompress it, I would highly recommend to rename all properties. For example rename "wikibase-entityid" to "wid". You can reduce the total size by 30~50% by preprocessing the file with a simple python script (import bz2, multithread etc.) The bz2 compressed file goes from 70GB to 1400GB because there is a high redundancy in raw data, you can easily go back to 800GB by removing useless data and renaming properties.

On my SSD with a python script I go from ~1500 lines/second (compressed) to ~90.000 lines/second (uncompressed). So it's about ~15min to process wikidata entirely on disk without spark, index, or cache.

JonathanRaiman commented 2 years ago

Seconding @Whiax's comments - lots of the bulk of the decompressed format is a lot of repeated field names and things of the sort. Any effort towards removing fields you won't need dramatically reduces the output file. It's possible to operate on the compress dump directly (but the speed penalty is quite big per @Whiax 's numbers)