sXperfect / gvc

Genomic Variant Codec (GVC)
Other
6 stars 0 forks source link

Discrepancy in Results When Replicating Compression of 1000 Genomes Project Dataset Using GVC Compression Tool #1

Open luo-xiaolong opened 11 months ago

luo-xiaolong commented 11 months ago

Content: Hello,

I have been attempting to replicate your code but found significant discrepancies between my results and those reported in your paper. The environment I used is as follows:

The computer used in tests had the following configuration: 2 Intel Xeon E5-2620 v3 CPUs, with 6 double-threaded cores per CPU, totaling 24 threads, each clocked at 2.4 GHz, 128 GiB RAM, 1 SSD of size 2 TiB, with a buffered read speed of 456.04 MB/sec as reported by hdparm -t. Python version: python3.8 My replication steps were:

Clone the repository: git clone https://github.com/sXperfect/gvc Run setup: bash setup.sh Use the encoder: JBIG (jbgtopbm85) Install the package: python setup.py install Execute the command: python -m gvc encode 1000GP3/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz chr1.gvc

I expected the result for chr1.gvc to be around 70MB, but I obtained chr1.gvc with a size of 321.12 MB. This size does not include the chr1.gvc.metadata folder. Could you please suggest possible reasons for this discrepancy and any potential solutions?

sXperfect commented 11 months ago

Thank you for using our tool!

So the problem lies with the transformation. By default all transformations are off. To see the full options of our tool you can use python3 -m gvc encode --help.

To achieve the best performance, use --sort-rows, --sort-cols, --binarisation row_bin_split and set the block size accordingly (see table 1 in our paper for the parameter values).