Open shengqh opened 8 months ago
Update: when I tested with chr1 with 32355811 variants at local computer using singularity instead of docker with 200g spark memory, it also failed.
Although the test is still running now, I am pretty sure the following solution solved the problem.
#https://discuss.hail.is/t/i-get-a-negativearraysizeexception-when-loading-a-plink-file/899
export PYSPARK_SUBMIT_ARGS="--driver-java-options '-XX:hashCode=0' --conf 'spark.executor.extraJavaOptions=-XX:hashCode=0' pyspark-shell"
Hey @shengqh !
Yeah, this is a bug in Kryo, a serialization library used by Spark, which cannot handle the size of data you're producing.
This is partly a deficiency in Hail: we assume that PLINK files are relatively small, in particular that the number of variants is small.
This issue was supposedly resolved in Spark 2.4.0+ and 3.0.0+ by https://github.com/apache/spark/commit/3e033035a3c0b7d46c2ae18d0d322d4af3808711 . You appear to be running Apache Spark version 3.3.2, so I'm surprised you encountered this. Can you confirm which version of the Kryo JAR you have in your environment?
Can you also share a bit of information about this PLINK file? import_plink
could obviously be modified to support 30M+ variant PLINK files, but I'd like to understand better why such large PLINK files exist. Do you expect these files to continue to grow in size? Do other consumers of these PLINK files want one PLINK file per chromosome? Would it be possible to generate many PLINK files per chromosome such that all the PLINK files have roughly the same size in bytes?
Thanks for your feedback and help improving Hail!
Related issue: https://github.com/hail-is/hail/issues/5564 .
@danking:
Hey @shengqh !
Yeah, this is a bug in Kryo, a serialization library used by Spark, which cannot handle the size of data you're producing.
This is partly a deficiency in Hail: we assume that PLINK files are relatively small, in particular that the number of variants is small.
This issue was supposedly resolved in Spark 2.4.0+ and 3.0.0+ by apache/spark@3e03303 . You appear to be running Apache Spark version 3.3.2, so I'm surprised you encountered this. Can you confirm which version of the Kryo JAR you have in your environment?
I don't know the Kryo JAR. I tested on both docker images hailgenetics/hail:0.2.126-py3.11 and hailgenetics/hail:0.2.127-py3.11.
Can you also share a bit of information about this PLINK file?
import_plink
could obviously be modified to support 30M+ variant PLINK files, but I'd like to understand better why such large PLINK files exist. Do you expect these files to continue to grow in size? Do other consumers of these PLINK files want one PLINK file per chromosome? Would it be possible to generate many PLINK files per chromosome such that all the PLINK files have roughly the same size in bytes?
We have a 35K cohort. The VCF format of chr1 is 2.4T. So we prefer to deliver plink bed format and hail matrix. And, the cohort will continue to grow in future. I will prefer to keep one file per chromosome.
For large cohort, which format do you prefer? Hail matrix or Hail VDS?
Thanks for your feedback and help improving Hail!
Related issue: #5564 .
We have a 35K cohort. The VCF format of chr1 is 2.4T.
Heh. So, yes, "project" VCFs grow super-linearly in the number of samples. I (and others) are currently pushing very hard for the VCF spec to support two sparse representations: "local alleles" (samtools/hts-specs#434) and "reference blocks" (samtools/hts-specs#435). When using these two sparse representations, you should be able to store 35,000 whole genomes in ~10TiB of GZIP-compressed VCF.
What is your calling pipeline? Do you generate GVCFs? If yes, I strongly recommend you use the VDS Combiner to produce a VDS. You can read more details in this recent preprint we wrote, but a VDS of 35,000 whole genomes should be a few terabytes. I'd guess 4 TiB, but it depends on your reference block granularity. I strongly recommend using size 10 GQ buckets.
I don't know the Kryo JAR. I tested on both docker images hailgenetics/hail:0.2.126-py3.11 and hailgenetics/hail:0.2.127-py3.11.
Those should use Kryo 4.0.2. OK. My conclusion is that Kryo still has a bug preventing the serialization of very large objects. This becomes a limitation in Hail: we cannot support PLINK files with tens of millions of variants. Our community is largely transitioning to GVCFs and VDS, so I doubt we'll improve our PLINK1 importer to support such large PLINK1 files. That said, PRs are always welcome if loading such large PLINK1 files is a hard requirement for you all.
We have a 35K cohort. The VCF format of chr1 is 2.4T.
Heh. So, yes, "project" VCFs grow super-linearly in the number of samples. I (and others) are currently pushing very hard for the VCF spec to support two sparse representations: "local alleles" (samtools/hts-specs#434) and "reference blocks" (samtools/hts-specs#435). When using these two sparse representations, you should be able to store 35,000 whole genomes in ~10TiB of GZIP-compressed VCF.
What is your calling pipeline? Do you generate GVCFs? If yes, I strongly recommend you use the VDS Combiner to produce a VDS. You can read more details in this recent preprint we wrote, but a VDS of 35,000 whole genomes should be a few terabytes. I'd guess 4 TiB, but it depends on your reference block granularity. I strongly recommend using size 10 GQ buckets.
Looks like VDS is a better solution than HailMatrix. However, we got the joint call result as vcf alreay. Can VDS Combiner read joint call VCF and then save it as VDS format? I cannot find any example to transfer VCF to VDS. Thanks.
I don't know the Kryo JAR. I tested on both docker images hailgenetics/hail:0.2.126-py3.11 and hailgenetics/hail:0.2.127-py3.11.
Those should use Kryo 4.0.2. OK. My conclusion is that Kryo still has a bug preventing the serialization of very large objects. This becomes a limitation in Hail: we cannot support PLINK files with tens of millions of variants. Our community is largely transitioning to GVCFs and VDS, so I doubt we'll improve our PLINK1 importer to support such large PLINK1 files. That said, PRs are always welcome if loading such large PLINK1 files is a hard requirement for you all.
What happened?
I wrote a bed2hailmatrix workflow and ran it on Terra platform to convert from plink bed format to hail matrix format.
https://github.com/shengqh/warp/blob/develop/pipelines/vumc_biostatistics/genotype/VUMCBed2HailMatrix.wdl
code is pretty simple:
When I tested it on the chr12 with 34523 samples and 18377527 variants from one of my dataset in Terra (100 g was allocated for this task), it failed with error message:
Interesting thing is, when I tried to convert the exactly same data in local computer using singularity instead of docker, it worked. Also, for the the other chromosomes with less variants but same samples, such as chr13, it worked well in Terra.
Since we will convert multiple plink files to hailmatrix table using Terra platform in future, I need to figure the problem out. Any advise would be appreciated.
Version
0.2.127
Relevant log output