Index is much larger (10x) than (compressed) input file

cpatulea commented 5 years ago

I am trying to index the CommonCrawl 'Host-Level Web Graphs' files: http://commoncrawl.org/2018/11/web-graphs-aug-sep-oct-2018/

specifically "Host-level graph", files pointed to by "cc-main-2018-aug-sep-oct-host-vertices.paths.gz". In practice these are a sequence of ~100-200 MB gzip files. The uncompressed content is a line-oriented mapping of id to reverse-hostname:

$ ls -lhs *7d*.gz
184M -rw-r----- 1 catalinp adm 184M Nov 22 19:33 part-00000-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz
119M -rw-r----- 1 catalinp adm 119M Nov 22 19:33 part-00001-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz
134M -rw-r----- 1 catalinp adm 134M Nov 22 19:33 part-00002-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz
139M -rw-r----- 1 catalinp adm 139M Nov 22 19:33 part-00003-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz
143M -rw-r----- 1 catalinp adm 143M Nov 22 19:33 part-00004-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz

$ zcat part-00002-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz |head
44821498        cc.chinalotus.d5rrn5hzb3
44821499        cc.chinalotus.d5rt7
44821500        cc.chinalotus.d5rtp
44821501        cc.chinalotus.d5rv5
44821502        cc.chinalotus.d5rvb
44821503        cc.chinalotus.d5rvn
44821504        cc.chinalotus.d5rvp
44821505        cc.chinalotus.d5rx5
44821506        cc.chinalotus.d5rx7
44821507        cc.chinalotus.d5rxb

I am building zindex using the following command (tab-delimited, field 2, unique):

zindex --tab-delimiter -f 2 -u -v part-00001-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz

The index is significantly larger (10x) than the input gzip file:

184M -rw-r----- 1 catalinp adm 184M Nov 22 19:33 part-00000-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz
1.8G -rw-r----- 1 catalinp adm 1.8G Nov 25 14:18 part-00000-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz.zindex

119M -rw-r----- 1 catalinp adm 119M Nov 22 19:33 part-00001-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz
1.5G -rw-r----- 1 catalinp adm 1.5G Nov 25 14:16 part-00001-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz.zindex

134M -rw-r----- 1 catalinp adm 134M Nov 22 19:33 part-00002-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz
1.7G -rw-r----- 1 catalinp adm 1.7G Nov 25 14:17 part-00002-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz.zindex

139M -rw-r----- 1 catalinp adm 139M Nov 22 19:33 part-00003-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz
1.8G -rw-r----- 1 catalinp adm 1.8G Nov 25 14:18 part-00003-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz.zindex

The index, however, is functional:

$ zq part-00001-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz biz.tamesu.f5ycpp
26437738        biz.tamesu.f5ycpp

cpatulea commented 5 years ago

seekgzip index appears much smaller (~40x smaller than gzip file):

184M -rw-r----- 1 catalinp adm 184M Nov 22 19:33 part-00000-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz
5.6M -rw-r----- 1 catalinp adm 5.6M Nov 25 14:36 part-00000-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz.idx

119M -rw-r----- 1 catalinp adm 119M Nov 22 19:33 part-00001-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz
3.6M -rw-r----- 1 catalinp adm 3.6M Nov 25 14:36 part-00001-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz.idx

134M -rw-r----- 1 catalinp adm 134M Nov 22 19:33 part-00002-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz
4.0M -rw-r----- 1 catalinp adm 4.0M Nov 25 14:36 part-00002-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz.idx

139M -rw-r----- 1 catalinp adm 139M Nov 22 19:33 part-00003-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz
4.2M -rw-r----- 1 catalinp adm 4.2M Nov 25 14:36 part-00003-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz.idx

and since my input is already sorted, I will attempt to use seekgzip + binary search for random access.

mattgodbolt commented 5 years ago

Yikes! That's unfortunate. With your input already sorted, perhaps seekgzip et al are a better way to go. Behind the scenes zindex uses mysql (as you've probably noted) and we rely on its index's efficiency. I've had good results when the lines themselves are long compared to the key, but in this case I suspect the compressability of the file means the sqlite key size dwarfs the original.

I'm not sure there's much to do here other than note that your file is probably not a great match for zindex

Thanks for taking the time to investigate, and to send a PR!

mattgodbolt commented 5 years ago

In theory running zindex with no index at all (and tuning the `--checkpoint-every value) should yield very similar results to seekgzip, based on my reading of what it's doing. You're still stuck with trying to do a binary search: I know https://github.com/hellige/au does something like this (albeit with its own file format).

cpatulea commented 5 years ago

Indeed! zindex -v (default checkpoint-every 32 MiB) was lightning fast and produced a tiny index file:

$ ~/src/zindex/build/Release/zindex -v part-00000-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz
Opening database part-00000-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz.zindex in read-write mode
Building index, generating a checkpoint every 32.00 MiB
Indexing...
Progress: 10 bytes of 183.25 MiB (0.00%)
Index building complete; creating line index
Flushing
Done
Closing database

-rw-r----- 1 catalinp adm 192,156,707 Nov 22 19:33 part-00000-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz
-rw-r----- 1 catalinp adm   5,779,226 Nov 25 14:36 part-00000-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz.idx
-rw-r----- 1 catalinp adm     189,440 Nov 26 23:30 part-00000-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz.zindex

Checkpoint-every 32K (I believe equivalent to seekgzip) is a bit sluggish and produces still quite large index:

$ ~/src/zindex/build/Release/zindex --checkpoint-every 32768 -v part-00000-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz
Warning: Rebuilding existing index part-00000-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz.zindex
Opening database part-00000-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz.zindex in read-write mode
Building index, generating a checkpoint every 32.00 KiB
Indexing...
Progress: 10 bytes of 183.25 MiB (0.00%)
Progress: 89.42 MiB of 183.25 MiB (48.80%)
Index building complete; creating line index
Flushing
Done
Closing database

-rw-r----- 1 catalinp adm  57,297,920 Nov 26 23:33 part-00000-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz.zindex

I'm just recording findings for posterity at this point, I understand zindex is probably just not the right tool for the job.

Thanks for the pointer to au. I was also looking at bsearch: https://gitlab.com/ole.tange/tangetools/blob/master/bsearch/bsearch

Thanks for merging my PR!

mattgodbolt commented 5 years ago

Thanks for following up! I would imagine 32MB is more than enough :) I'm surprised seekgzip does every 32kb (that's basically at every gzip block boundary!). It's always a tradeoff but ungzipping is pretty quick :)

I'm going to close this off; thank you for your in-depth comments!

mattgodbolt / zindex

Index is much larger (10x) than (compressed) input file #38