lskatz / mlst-hash-template

This is a template for any new hash-based MLST database
GNU General Public License v3.0
5 stars 2 forks source link

actual allele database compression #23

Closed lskatz closed 1 year ago

lskatz commented 1 year ago

See how well a database can compress if the allele is not hashed and it just uses the "plaintext" algorithm. Otherwise, it is in the same format, with one allele per line.

lskatz commented 1 year ago

Trying it out here: https://github.com/lskatz/mlst-hash-template/compare/master...plaintext

lskatz commented 1 year ago
$ perl scripts/digestFasta.pl --hash plaintext ~/GWA/projects/validation/mlstComparison/MLST.db/Salmonella_enterica.chewbbaca/*.fasta --out salmonella_enterica.plaintext
$ cd salmonella_enterica.plaintext
$ ls -lh
total 3.4G
-rw-------. 1 gzu2 users 3.4G Jan 31 17:27 alleles.tsv
-rw-------. 1 gzu2 users 7.0M Jan 31 17:27 ref.fasta
$ sort -k2,2 alleles.tsv > sorted.tsv
$ ls -lhS
total 6.7G
-rw-------. 1 gzu2 users 3.4G Jan 31 17:27 alleles.tsv
-rw-------. 1 gzu2 users 3.4G Jan 31 20:01 sorted.tsv
-rw-------. 1 gzu2 users 7.0M Jan 31 17:27 ref.fasta
$ gzip *.tsv
$ ls -lh 
total 191M
-rw-------. 1 gzu2 users 110M Jan 31 17:27 alleles.tsv.gz
-rw-------. 1 gzu2 users 7.0M Jan 31 17:27 ref.fasta
-rw-------. 1 gzu2 users  74M Jan 31 20:01 sorted.tsv.gz

So the vanilla S. enterica database can compress down to about 74M with default parameters if sorted. Not including ref.fasta which will not be tested here.

lskatz commented 1 year ago
$ perl scripts/digestFasta.pl --hash md5 ~/GWA/projects/validation/mlstComparison/MLST.db/Salmonella_enterica.chewbbaca/*.fasta --out salmonella_enterica.md5
$ cd salmonella_enterica.md5
$ ls -lhS
total 152M
-rw-------. 1 gzu2 users 145M Jan 31 20:09 alleles.tsv
-rw-------. 1 gzu2 users 7.0M Jan 31 20:09 ref.fasta
$ sort -k2,2 alleles.tsv > sorted.tsv
$ ls -lhS
total 297M
-rw-------. 1 gzu2 users 145M Jan 31 20:09 alleles.tsv
-rw-------. 1 gzu2 users 145M Jan 31 20:10 sorted.tsv
-rw-------. 1 gzu2 users 7.0M Jan 31 20:09 ref.fasta
$ gzip alleles.tsv & gzip sorted.tsv
[1] 20744
[1]+  Done                    gzip alleles.tsv
$ ls -lhS
total 117M
-rw-------. 1 gzu2 users  58M Jan 31 20:10 sorted.tsv.gz
-rw-------. 1 gzu2 users  52M Jan 31 20:09 alleles.tsv.gz
-rw-------. 1 gzu2 users 7.0M Jan 31 20:09 ref.fasta

Ok and in this case with md5, it goes from 145M to 52M. Surprisingly, sorting on the hash makes the size go up.

$ zcat alleles.tsv.gz | sort -k1,1 | gzip -c > alleles.k1.tsv.gz
$ zcat alleles.tsv.gz | sort -k1,2 | gzip -c > alleles.k12.tsv.gz
$ ls -lhS
total 220M
-rw-------. 1 gzu2 users  58M Jan 31 20:10 sorted.tsv.gz
-rw-------. 1 gzu2 users  52M Jan 31 20:09 alleles.tsv.gz
-rw-------. 1 gzu2 users  52M Jan 31 20:15 alleles.k12.tsv.gz
-rw-------. 1 gzu2 users  52M Jan 31 20:13 alleles.k1.tsv.gz
-rw-------. 1 gzu2 users 7.0M Jan 31 20:09 ref.fasta

And sorting on the first field or first and then second field doesn't give any gain. It's all 52M.

lskatz commented 1 year ago

What about sha256?

$ perl scripts/digestFasta.pl --hash sha256 ~/GWA/projects/validation/mlstComparison/MLST.db/Salmonella_enterica.chewbbaca/*.fasta --out salmonella_enterica.sha256
$ cat alleles.tsv | sort -k1,2 | gzip -c > alleles.k12.tsv.gz &
[1] 26373
$ cat alleles.tsv | sort -k1,1 | gzip -c > alleles.k1.tsv.gz &
[2] 26432
$ cat alleles.tsv | sort -k2,2 | gzip -c > alleles.k2.tsv.gz &
[3] 26521
$ ls -lhS
total 513M
-rw-------. 1 gzu2 users 212M Jan 31 20:18 alleles.tsv
-rw-------. 1 gzu2 users 103M Jan 31 20:20 alleles.k2.tsv.gz
-rw-------. 1 gzu2 users  96M Jan 31 20:20 alleles.k12.tsv.gz
-rw-------. 1 gzu2 users  96M Jan 31 20:20 alleles.k1.tsv.gz
-rw-------. 1 gzu2 users 7.0M Jan 31 20:18 ref.fasta

In this case, sorting on the hash did not help at all. However, compression took us from 212M down to 96M.

lskatz commented 1 year ago

So in the end, the worst compression came from sha256 hashes which got us down to 96M, then plaintext to 74M, then md5, which is at 52M.