Closed cpatulea closed 5 years ago
seekgzip index appears much smaller (~40x smaller than gzip file):
184M -rw-r----- 1 catalinp adm 184M Nov 22 19:33 part-00000-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz
5.6M -rw-r----- 1 catalinp adm 5.6M Nov 25 14:36 part-00000-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz.idx
119M -rw-r----- 1 catalinp adm 119M Nov 22 19:33 part-00001-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz
3.6M -rw-r----- 1 catalinp adm 3.6M Nov 25 14:36 part-00001-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz.idx
134M -rw-r----- 1 catalinp adm 134M Nov 22 19:33 part-00002-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz
4.0M -rw-r----- 1 catalinp adm 4.0M Nov 25 14:36 part-00002-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz.idx
139M -rw-r----- 1 catalinp adm 139M Nov 22 19:33 part-00003-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz
4.2M -rw-r----- 1 catalinp adm 4.2M Nov 25 14:36 part-00003-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz.idx
and since my input is already sorted, I will attempt to use seekgzip + binary search for random access.
Yikes! That's unfortunate. With your input already sorted, perhaps seekgzip et al are a better way to go. Behind the scenes zindex
uses mysql (as you've probably noted) and we rely on its index's efficiency. I've had good results when the lines themselves are long compared to the key, but in this case I suspect the compressability of the file means the sqlite key size dwarfs the original.
I'm not sure there's much to do here other than note that your file is probably not a great match for zindex
Thanks for taking the time to investigate, and to send a PR!
In theory running zindex
with no index at all (and tuning the `--checkpoint-every
value) should yield very similar results to seekgzip
, based on my reading of what it's doing. You're still stuck with trying to do a binary search: I know https://github.com/hellige/au does something like this (albeit with its own file format).
Indeed! zindex -v
(default checkpoint-every 32 MiB) was lightning fast and produced a tiny index file:
$ ~/src/zindex/build/Release/zindex -v part-00000-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz
Opening database part-00000-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz.zindex in read-write mode
Building index, generating a checkpoint every 32.00 MiB
Indexing...
Progress: 10 bytes of 183.25 MiB (0.00%)
Index building complete; creating line index
Flushing
Done
Closing database
-rw-r----- 1 catalinp adm 192,156,707 Nov 22 19:33 part-00000-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz
-rw-r----- 1 catalinp adm 5,779,226 Nov 25 14:36 part-00000-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz.idx
-rw-r----- 1 catalinp adm 189,440 Nov 26 23:30 part-00000-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz.zindex
Checkpoint-every 32K (I believe equivalent to seekgzip) is a bit sluggish and produces still quite large index:
$ ~/src/zindex/build/Release/zindex --checkpoint-every 32768 -v part-00000-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz
Warning: Rebuilding existing index part-00000-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz.zindex
Opening database part-00000-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz.zindex in read-write mode
Building index, generating a checkpoint every 32.00 KiB
Indexing...
Progress: 10 bytes of 183.25 MiB (0.00%)
Progress: 89.42 MiB of 183.25 MiB (48.80%)
Index building complete; creating line index
Flushing
Done
Closing database
-rw-r----- 1 catalinp adm 57,297,920 Nov 26 23:33 part-00000-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz.zindex
I'm just recording findings for posterity at this point, I understand zindex is probably just not the right tool for the job.
Thanks for the pointer to au. I was also looking at bsearch: https://gitlab.com/ole.tange/tangetools/blob/master/bsearch/bsearch
Thanks for merging my PR!
Thanks for following up! I would imagine 32MB is more than enough :) I'm surprised seekgzip does every 32kb (that's basically at every gzip block boundary!). It's always a tradeoff but ungzipping is pretty quick :)
I'm going to close this off; thank you for your in-depth comments!
I am trying to index the CommonCrawl 'Host-Level Web Graphs' files: http://commoncrawl.org/2018/11/web-graphs-aug-sep-oct-2018/
specifically "Host-level graph", files pointed to by "cc-main-2018-aug-sep-oct-host-vertices.paths.gz". In practice these are a sequence of ~100-200 MB gzip files. The uncompressed content is a line-oriented mapping of id to reverse-hostname:
I am building zindex using the following command (tab-delimited, field 2, unique):
zindex --tab-delimiter -f 2 -u -v part-00001-7dfdd744-6a7a-4aa9-ba49-1eb07ccd20c7-c000.txt.gz
The index is significantly larger (10x) than the input gzip file:
The index, however, is functional: