hail-is / hail

Cloud-native genomic dataframes and batch computing
https://hail.is
MIT License
975 stars 243 forks source link

NegativeArraySizeException when converting plink bed to hailmatrix on large chromosome #14168

Open shengqh opened 8 months ago

shengqh commented 8 months ago

What happened?

I wrote a bed2hailmatrix workflow and ran it on Terra platform to convert from plink bed format to hail matrix format.

https://github.com/shengqh/warp/blob/develop/pipelines/vumc_biostatistics/genotype/VUMCBed2HailMatrix.wdl

code is pretty simple:

import hail as hl

hl.init(spark_conf={"spark.driver.memory": "~{memory_gb}g"})

#contig_recoding is hard coded for human only
dsplink = hl.import_plink(bed="~{source_bed}",
                          bim="~{source_bim}",
                          fam="~{source_fam}",
                          reference_genome="~{reference_genome}",
                          contig_recoding={
                            '1': 'chr1',
                            '2': 'chr2',
                            '3': 'chr3',
                            '4': 'chr4',
                            '5': 'chr5',
                            '6': 'chr6',
                            '7': 'chr7',
                            '8': 'chr8',
                            '9': 'chr9',
                            '10': 'chr10',
                            '11': 'chr11',
                            '12': 'chr12',
                            '13': 'chr13',
                            '14': 'chr14',
                            '15': 'chr15',
                            '16': 'chr16',
                            '17': 'chr17',
                            '18': 'chr18',
                            '19': 'chr19',
                            '20': 'chr20',
                            '21': 'chr21',
                            '22': 'chr22',
                            'X': 'chrX',
                            'Y': 'chrY',
                            'MT': 'chrM'})

dsplink.write("~{target_prefix}", overwrite=True)

When I tested it on the chr12 with 34523 samples and 18377527 variants from one of my dataset in Terra (100 g was allocated for this task), it failed with error message:

java.lang.NegativeArraySizeException: null
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.resize(IdentityObjectIntMap.java:542)
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.putStash(IdentityObjectIntMap.java:306)
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.push(IdentityObjectIntMap.java:300)
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.put(IdentityObjectIntMap.java:162)
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.putStash(IdentityObjectIntMap.java:307)
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.push(IdentityObjectIntMap.java:300)
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.put(IdentityObjectIntMap.java:162)
at com.esotericsoftware.kryo.util.MapReferenceResolver.addWrittenObject(MapReferenceResolver.java:41)
at com.esotericsoftware.kryo.Kryo.writeReferenceOrNull(Kryo.java:681)
at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:616)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$StringArraySerializer.write(DefaultArraySerializers.java:272)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$StringArraySerializer.write(DefaultArraySerializers.java:258)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
at com.twitter.chill.WrappedArraySerializer.write(WrappedArraySerializer.scala:28)
at com.twitter.chill.WrappedArraySerializer.write(WrappedArraySerializer.scala:23)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:361)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:302)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:575)
at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:79)
at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:508)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:575)
at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:79)
at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:508)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:361)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:302)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
at org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:270)
at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$blockifyObject$4(TorrentBroadcast.scala:321)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:323)
at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:140)
at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:95)
at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:75)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1539)
at is.hail.backend.spark.SparkBackend.broadcast(SparkBackend.scala:411)
at is.hail.io.plink.MatrixPLINKReader.executeGeneric(LoadPlink.scala:390)
at is.hail.io.plink.MatrixPLINKReader.lower(LoadPlink.scala:561)
at is.hail.expr.ir.TableReader.lower(TableIR.scala:663)
at is.hail.expr.ir.lowering.LowerTableIR$.applyTable(LowerTableIR.scala:1062)
at is.hail.expr.ir.lowering.LowerTableIR$.lower$1(LowerTableIR.scala:728)
at is.hail.expr.ir.lowering.LowerTableIR$.apply(LowerTableIR.scala:1021)
at is.hail.expr.ir.lowering.LowerToCDA$.lower(LowerToCDA.scala:27)
at is.hail.expr.ir.lowering.LowerToCDA$.apply(LowerToCDA.scala:11)
at is.hail.expr.ir.lowering.LowerToDistributedArrayPass.transform(LoweringPass.scala:91)
at is.hail.expr.ir.LowerOrInterpretNonCompilable$.evaluate$1(LowerOrInterpretNonCompilable.scala:27)
at is.hail.expr.ir.LowerOrInterpretNonCompilable$.rewrite$1(LowerOrInterpretNonCompilable.scala:59)
at is.hail.expr.ir.LowerOrInterpretNonCompilable$.apply(LowerOrInterpretNonCompilable.scala:64)
at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.transform(LoweringPass.scala:83)
at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:32)
at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:84)
at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:32)
at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:84)
at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:30)
at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:29)
at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.apply(LoweringPass.scala:78)
at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:21)
at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:19)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:19)
at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:45)
at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:601)
at is.hail.backend.spark.SparkBackend.$anonfun$execute$4(SparkBackend.scala:637)
at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:84)
at is.hail.backend.spark.SparkBackend.$anonfun$execute$3(SparkBackend.scala:632)
at is.hail.backend.spark.SparkBackend.$anonfun$execute$3$adapted(SparkBackend.scala:631)
at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:77)
at is.hail.utils.package$.using(package.scala:665)
at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:77)
at is.hail.utils.package$.using(package.scala:665)
at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:64)
at is.hail.backend.spark.SparkBackend.$anonfun$withExecuteContext$2(SparkBackend.scala:407)
at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:55)
at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:62)
at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:393)
at is.hail.backend.spark.SparkBackend.execute(SparkBackend.scala:631)
at is.hail.backend.BackendHttpHandler.handle(BackendServer.scala:89)
at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:79)
at sun.net.httpserver.AuthFilter.doFilter(AuthFilter.java:83)
at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:82)
at sun.net.httpserver.ServerImpl$Exchange$LinkHandler.handle(ServerImpl.java:822)
at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:79)
at sun.net.httpserver.ServerImpl$Exchange.run(ServerImpl.java:794)
at sun.net.httpserver.ServerImpl$DefaultExecutor.execute(ServerImpl.java:199)
at sun.net.httpserver.ServerImpl$Dispatcher.handle(ServerImpl.java:544)
at sun.net.httpserver.ServerImpl$Dispatcher.run(ServerImpl.java:509)
at java.lang.Thread.run(Thread.java:750)

Interesting thing is, when I tried to convert the exactly same data in local computer using singularity instead of docker, it worked. Also, for the the other chromosomes with less variants but same samples, such as chr13, it worked well in Terra.

Since we will convert multiple plink files to hailmatrix table using Terra platform in future, I need to figure the problem out. Any advise would be appreciated.

Version

0.2.127

Relevant log output

2024/01/17 20:20:25 Starting container setup.
2024/01/17 20:20:26 Done container setup.
2024/01/17 20:20:27 Starting localization.
2024/01/17 20:20:34 Localization script execution started...
2024/01/17 20:20:34 Localizing input gs://fc-5a8938eb-1299-4afc-957f-afb53ef602b9/submissions/e8747e74-47d1-4f52-acfc-1ac7f81d79ba/VUMCBed2HailMatrix/683447d9-9342-4058-bcfc-ba21422d3121/call-Bed2HailMatrix/script -> /cromwell_root/script
2024/01/17 20:20:36 Localizing input gs://hui-sandbox/ICA-AGD/plink1/chr12.bed -> /cromwell_root/hui-sandbox/ICA-AGD/plink1/chr12.bed
2024/01/17 20:59:18 Localizing input gs://hui-sandbox/ICA-AGD/plink1/chr12.fam -> /cromwell_root/hui-sandbox/ICA-AGD/plink1/chr12.fam
2024/01/17 20:59:18 Localizing input gs://hui-sandbox/ICA-AGD/plink1/chr12.bim -> /cromwell_root/hui-sandbox/ICA-AGD/plink1/chr12.bim
Copying gs://hui-sandbox/ICA-AGD/plink1/chr12.fam...
/ [0 files][ 0.0 B/910.3 KiB] / [1 files][910.3 KiB/910.3 KiB] Copying gs://hui-sandbox/ICA-AGD/plink1/chr12.bim...
/ [1 files][910.3 KiB/369.7 MiB] - - [1 files][ 51.9 MiB/369.7 MiB] \ | | [1 files][107.6 MiB/369.7 MiB] / - - [1 files][162.3 MiB/369.7 MiB] \ \ [1 files][213.9 MiB/369.7 MiB] | / / [1 files][286.6 MiB/369.7 MiB] - \ \ [1 files][342.1 MiB/369.7 MiB] |
Operation completed over 2 objects/369.7 MiB.
| [2 files][369.7 MiB/369.7 MiB] 2024/01/17 20:59:27 Localization script execution complete.
2024/01/17 20:59:38 Done localization.
2024/01/17 20:59:39 Running user action: docker run -v /mnt/local-disk:/cromwell_root --entrypoint=/bin/bash hailgenetics/hail@sha256:3f22576793ce3161893aed2bd40949b1fc822d2b7e6517dc0ac993b62badaff8 /cromwell_root/script
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/cromwell_root/tmp.81879b1c
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/cromwell_root/tmp.81879b1c
24/01/17 20:59:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Running on Apache Spark version 3.3.2
SparkUI available at http://523bc6a27b69:4040
Welcome to
__ __ <>__
/ /_/ /__ __/ /
/ __ / _ `/ / /
/_/ /_/\_,_/_/_/ version 0.2.127-bb535cd096c5
LOGGING: writing to /cromwell_root/hail-20240117-2059-0.2.127-bb535cd096c5.log
2024-01-17 21:01:32.019 Hail: INFO: Found 34523 samples in fam file.
2024-01-17 21:01:32.020 Hail: INFO: Found 18377527 variants in bim file.
2024-01-17 21:02:45.920 Hail: INFO: Found 34523 samples in fam file.
2024-01-17 21:02:45.920 Hail: INFO: Found 18377527 variants in bim file.
Traceback (most recent call last):
File "<stdin>", line 38, in <module>
File "<decorator-gen-1366>", line 2, in write
File "/usr/local/lib/python3.10/dist-packages/hail/typecheck/check.py", line 584, in wrapper
return __original_func(*args_, **kwargs_)
File "/usr/local/lib/python3.10/dist-packages/hail/matrixtable.py", line 2807, in write
Env.backend().execute(ir.MatrixWrite(self._mir, writer))
File "/usr/local/lib/python3.10/dist-packages/hail/backend/backend.py", line 190, in execute
raise e.maybe_user_error(ir) from None
File "/usr/local/lib/python3.10/dist-packages/hail/backend/backend.py", line 188, in execute
result, timings = self._rpc(ActionTag.EXECUTE, payload)
File "/usr/local/lib/python3.10/dist-packages/hail/backend/py4j_backend.py", line 220, in _rpc
raise fatal_error_from_java_error_triplet(
hail.utils.java.FatalError: NegativeArraySizeException: null

Java stack trace:
com.esotericsoftware.kryo.KryoException: java.lang.NegativeArraySizeException
Serialization trace:
values (org.apache.spark.sql.catalyst.expressions.GenericRow)
locusAlleles (is.hail.io.plink.PlinkVariant)
at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:101)
at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:508)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:575)
at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:79)
at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:508)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:361)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:302)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
at org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:270)
at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$blockifyObject$4(TorrentBroadcast.scala:321)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:323)
at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:140)
at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:95)
at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:75)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1539)
at is.hail.backend.spark.SparkBackend.broadcast(SparkBackend.scala:411)
at is.hail.io.plink.MatrixPLINKReader.executeGeneric(LoadPlink.scala:390)
at is.hail.io.plink.MatrixPLINKReader.lower(LoadPlink.scala:561)
at is.hail.expr.ir.TableReader.lower(TableIR.scala:663)
at is.hail.expr.ir.lowering.LowerTableIR$.applyTable(LowerTableIR.scala:1062)
at is.hail.expr.ir.lowering.LowerTableIR$.lower$1(LowerTableIR.scala:728)
at is.hail.expr.ir.lowering.LowerTableIR$.apply(LowerTableIR.scala:1021)
at is.hail.expr.ir.lowering.LowerToCDA$.lower(LowerToCDA.scala:27)
at is.hail.expr.ir.lowering.LowerToCDA$.apply(LowerToCDA.scala:11)
at is.hail.expr.ir.lowering.LowerToDistributedArrayPass.transform(LoweringPass.scala:91)
at is.hail.expr.ir.LowerOrInterpretNonCompilable$.evaluate$1(LowerOrInterpretNonCompilable.scala:27)
at is.hail.expr.ir.LowerOrInterpretNonCompilable$.rewrite$1(LowerOrInterpretNonCompilable.scala:59)
at is.hail.expr.ir.LowerOrInterpretNonCompilable$.apply(LowerOrInterpretNonCompilable.scala:64)
at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.transform(LoweringPass.scala:83)
at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:32)
at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:84)
at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:32)
at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:84)
at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:30)
at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:29)
at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.apply(LoweringPass.scala:78)
at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:21)
at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:19)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:19)
at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:45)
at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:601)
at is.hail.backend.spark.SparkBackend.$anonfun$execute$4(SparkBackend.scala:637)
at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:84)
at is.hail.backend.spark.SparkBackend.$anonfun$execute$3(SparkBackend.scala:632)
at is.hail.backend.spark.SparkBackend.$anonfun$execute$3$adapted(SparkBackend.scala:631)
at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:77)
at is.hail.utils.package$.using(package.scala:665)
at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:77)
at is.hail.utils.package$.using(package.scala:665)
at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:64)
at is.hail.backend.spark.SparkBackend.$anonfun$withExecuteContext$2(SparkBackend.scala:407)
at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:55)
at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:62)
at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:393)
at is.hail.backend.spark.SparkBackend.execute(SparkBackend.scala:631)
at is.hail.backend.BackendHttpHandler.handle(BackendServer.scala:89)
at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:79)
at sun.net.httpserver.AuthFilter.doFilter(AuthFilter.java:83)
at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:82)
at sun.net.httpserver.ServerImpl$Exchange$LinkHandler.handle(ServerImpl.java:822)
at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:79)
at sun.net.httpserver.ServerImpl$Exchange.run(ServerImpl.java:794)
at sun.net.httpserver.ServerImpl$DefaultExecutor.execute(ServerImpl.java:199)
at sun.net.httpserver.ServerImpl$Dispatcher.handle(ServerImpl.java:544)
at sun.net.httpserver.ServerImpl$Dispatcher.run(ServerImpl.java:509)
at java.lang.Thread.run(Thread.java:750)

java.lang.NegativeArraySizeException: null
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.resize(IdentityObjectIntMap.java:542)
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.putStash(IdentityObjectIntMap.java:306)
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.push(IdentityObjectIntMap.java:300)
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.put(IdentityObjectIntMap.java:162)
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.putStash(IdentityObjectIntMap.java:307)
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.push(IdentityObjectIntMap.java:300)
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.put(IdentityObjectIntMap.java:162)
at com.esotericsoftware.kryo.util.MapReferenceResolver.addWrittenObject(MapReferenceResolver.java:41)
at com.esotericsoftware.kryo.Kryo.writeReferenceOrNull(Kryo.java:681)
at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:616)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$StringArraySerializer.write(DefaultArraySerializers.java:272)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$StringArraySerializer.write(DefaultArraySerializers.java:258)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
at com.twitter.chill.WrappedArraySerializer.write(WrappedArraySerializer.scala:28)
at com.twitter.chill.WrappedArraySerializer.write(WrappedArraySerializer.scala:23)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:361)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:302)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:575)
at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:79)
at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:508)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:575)
at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:79)
at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:508)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:361)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:302)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
at org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:270)
at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$blockifyObject$4(TorrentBroadcast.scala:321)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:323)
at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:140)
at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:95)
at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:75)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1539)
at is.hail.backend.spark.SparkBackend.broadcast(SparkBackend.scala:411)
at is.hail.io.plink.MatrixPLINKReader.executeGeneric(LoadPlink.scala:390)
at is.hail.io.plink.MatrixPLINKReader.lower(LoadPlink.scala:561)
at is.hail.expr.ir.TableReader.lower(TableIR.scala:663)
at is.hail.expr.ir.lowering.LowerTableIR$.applyTable(LowerTableIR.scala:1062)
at is.hail.expr.ir.lowering.LowerTableIR$.lower$1(LowerTableIR.scala:728)
at is.hail.expr.ir.lowering.LowerTableIR$.apply(LowerTableIR.scala:1021)
at is.hail.expr.ir.lowering.LowerToCDA$.lower(LowerToCDA.scala:27)
at is.hail.expr.ir.lowering.LowerToCDA$.apply(LowerToCDA.scala:11)
at is.hail.expr.ir.lowering.LowerToDistributedArrayPass.transform(LoweringPass.scala:91)
at is.hail.expr.ir.LowerOrInterpretNonCompilable$.evaluate$1(LowerOrInterpretNonCompilable.scala:27)
at is.hail.expr.ir.LowerOrInterpretNonCompilable$.rewrite$1(LowerOrInterpretNonCompilable.scala:59)
at is.hail.expr.ir.LowerOrInterpretNonCompilable$.apply(LowerOrInterpretNonCompilable.scala:64)
at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.transform(LoweringPass.scala:83)
at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:32)
at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:84)
at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:32)
at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:84)
at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:30)
at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:29)
at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.apply(LoweringPass.scala:78)
at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:21)
at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:19)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:19)
at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:45)
at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:601)
at is.hail.backend.spark.SparkBackend.$anonfun$execute$4(SparkBackend.scala:637)
at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:84)
at is.hail.backend.spark.SparkBackend.$anonfun$execute$3(SparkBackend.scala:632)
at is.hail.backend.spark.SparkBackend.$anonfun$execute$3$adapted(SparkBackend.scala:631)
at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:77)
at is.hail.utils.package$.using(package.scala:665)
at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:77)
at is.hail.utils.package$.using(package.scala:665)
at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:64)
at is.hail.backend.spark.SparkBackend.$anonfun$withExecuteContext$2(SparkBackend.scala:407)
at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:55)
at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:62)
at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:393)
at is.hail.backend.spark.SparkBackend.execute(SparkBackend.scala:631)
at is.hail.backend.BackendHttpHandler.handle(BackendServer.scala:89)
at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:79)
at sun.net.httpserver.AuthFilter.doFilter(AuthFilter.java:83)
at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:82)
at sun.net.httpserver.ServerImpl$Exchange$LinkHandler.handle(ServerImpl.java:822)
at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:79)
at sun.net.httpserver.ServerImpl$Exchange.run(ServerImpl.java:794)
at sun.net.httpserver.ServerImpl$DefaultExecutor.execute(ServerImpl.java:199)
at sun.net.httpserver.ServerImpl$Dispatcher.handle(ServerImpl.java:544)
at sun.net.httpserver.ServerImpl$Dispatcher.run(ServerImpl.java:509)
at java.lang.Thread.run(Thread.java:750)

Hail version: 0.2.127-bb535cd096c5
Error summary: NegativeArraySizeException: null
tar: chr12: Cannot stat: No such file or directory
tar: Exiting with failure status due to previous errors
2024/01/17 21:10:34 Starting delocalization.
2024/01/17 21:10:34 Delocalization script execution started...
2024/01/17 21:10:34 Delocalizing output /cromwell_root/memory_retry_rc -> gs://fc-5a8938eb-1299-4afc-957f-afb53ef602b9/submissions/e8747e74-47d1-4f52-acfc-1ac7f81d79ba/VUMCBed2HailMatrix/683447d9-9342-4058-bcfc-ba21422d3121/call-Bed2HailMatrix/memory_retry_rc
2024/01/17 21:10:37 Delocalizing output /cromwell_root/rc -> gs://fc-5a8938eb-1299-4afc-957f-afb53ef602b9/submissions/e8747e74-47d1-4f52-acfc-1ac7f81d79ba/VUMCBed2HailMatrix/683447d9-9342-4058-bcfc-ba21422d3121/call-Bed2HailMatrix/rc
2024/01/17 21:10:39 Delocalizing output /cromwell_root/stdout -> gs://fc-5a8938eb-1299-4afc-957f-afb53ef602b9/submissions/e8747e74-47d1-4f52-acfc-1ac7f81d79ba/VUMCBed2HailMatrix/683447d9-9342-4058-bcfc-ba21422d3121/call-Bed2HailMatrix/stdout
2024/01/17 21:10:41 Delocalizing output /cromwell_root/stderr -> gs://fc-5a8938eb-1299-4afc-957f-afb53ef602b9/submissions/e8747e74-47d1-4f52-acfc-1ac7f81d79ba/VUMCBed2HailMatrix/683447d9-9342-4058-bcfc-ba21422d3121/call-Bed2HailMatrix/stderr
2024/01/17 21:10:42 Delocalizing output /cromwell_root/chr12.tar.gz -> gs://fc-5a8938eb-1299-4afc-957f-afb53ef602b9/submissions/e8747e74-47d1-4f52-acfc-1ac7f81d79ba/VUMCBed2HailMatrix/683447d9-9342-4058-bcfc-ba21422d3121/call-Bed2HailMatrix/chr12.tar.gz
2024/01/17 21:10:44 Delocalization script execution complete.
2024/01/17 21:10:44 Done delocalization.
shengqh commented 8 months ago

Update: when I tested with chr1 with 32355811 variants at local computer using singularity instead of docker with 200g spark memory, it also failed.

shengqh commented 8 months ago

Although the test is still running now, I am pretty sure the following solution solved the problem.

#https://discuss.hail.is/t/i-get-a-negativearraysizeexception-when-loading-a-plink-file/899

export PYSPARK_SUBMIT_ARGS="--driver-java-options '-XX:hashCode=0' --conf 'spark.executor.extraJavaOptions=-XX:hashCode=0' pyspark-shell"
danking commented 8 months ago

Hey @shengqh !

Yeah, this is a bug in Kryo, a serialization library used by Spark, which cannot handle the size of data you're producing.

This is partly a deficiency in Hail: we assume that PLINK files are relatively small, in particular that the number of variants is small.

This issue was supposedly resolved in Spark 2.4.0+ and 3.0.0+ by https://github.com/apache/spark/commit/3e033035a3c0b7d46c2ae18d0d322d4af3808711 . You appear to be running Apache Spark version 3.3.2, so I'm surprised you encountered this. Can you confirm which version of the Kryo JAR you have in your environment?

Can you also share a bit of information about this PLINK file? import_plink could obviously be modified to support 30M+ variant PLINK files, but I'd like to understand better why such large PLINK files exist. Do you expect these files to continue to grow in size? Do other consumers of these PLINK files want one PLINK file per chromosome? Would it be possible to generate many PLINK files per chromosome such that all the PLINK files have roughly the same size in bytes?

Thanks for your feedback and help improving Hail!

Related issue: https://github.com/hail-is/hail/issues/5564 .

shengqh commented 8 months ago

@danking:

Hey @shengqh !

Yeah, this is a bug in Kryo, a serialization library used by Spark, which cannot handle the size of data you're producing.

This is partly a deficiency in Hail: we assume that PLINK files are relatively small, in particular that the number of variants is small.

This issue was supposedly resolved in Spark 2.4.0+ and 3.0.0+ by apache/spark@3e03303 . You appear to be running Apache Spark version 3.3.2, so I'm surprised you encountered this. Can you confirm which version of the Kryo JAR you have in your environment?

I don't know the Kryo JAR. I tested on both docker images hailgenetics/hail:0.2.126-py3.11 and hailgenetics/hail:0.2.127-py3.11.

Can you also share a bit of information about this PLINK file? import_plink could obviously be modified to support 30M+ variant PLINK files, but I'd like to understand better why such large PLINK files exist. Do you expect these files to continue to grow in size? Do other consumers of these PLINK files want one PLINK file per chromosome? Would it be possible to generate many PLINK files per chromosome such that all the PLINK files have roughly the same size in bytes?

We have a 35K cohort. The VCF format of chr1 is 2.4T. So we prefer to deliver plink bed format and hail matrix. And, the cohort will continue to grow in future. I will prefer to keep one file per chromosome.

For large cohort, which format do you prefer? Hail matrix or Hail VDS?

Thanks for your feedback and help improving Hail!

Related issue: #5564 .

danking commented 8 months ago

We have a 35K cohort. The VCF format of chr1 is 2.4T.

Heh. So, yes, "project" VCFs grow super-linearly in the number of samples. I (and others) are currently pushing very hard for the VCF spec to support two sparse representations: "local alleles" (samtools/hts-specs#434) and "reference blocks" (samtools/hts-specs#435). When using these two sparse representations, you should be able to store 35,000 whole genomes in ~10TiB of GZIP-compressed VCF.

What is your calling pipeline? Do you generate GVCFs? If yes, I strongly recommend you use the VDS Combiner to produce a VDS. You can read more details in this recent preprint we wrote, but a VDS of 35,000 whole genomes should be a few terabytes. I'd guess 4 TiB, but it depends on your reference block granularity. I strongly recommend using size 10 GQ buckets.


I don't know the Kryo JAR. I tested on both docker images hailgenetics/hail:0.2.126-py3.11 and hailgenetics/hail:0.2.127-py3.11.

Those should use Kryo 4.0.2. OK. My conclusion is that Kryo still has a bug preventing the serialization of very large objects. This becomes a limitation in Hail: we cannot support PLINK files with tens of millions of variants. Our community is largely transitioning to GVCFs and VDS, so I doubt we'll improve our PLINK1 importer to support such large PLINK1 files. That said, PRs are always welcome if loading such large PLINK1 files is a hard requirement for you all.

shengqh commented 7 months ago

We have a 35K cohort. The VCF format of chr1 is 2.4T.

Heh. So, yes, "project" VCFs grow super-linearly in the number of samples. I (and others) are currently pushing very hard for the VCF spec to support two sparse representations: "local alleles" (samtools/hts-specs#434) and "reference blocks" (samtools/hts-specs#435). When using these two sparse representations, you should be able to store 35,000 whole genomes in ~10TiB of GZIP-compressed VCF.

What is your calling pipeline? Do you generate GVCFs? If yes, I strongly recommend you use the VDS Combiner to produce a VDS. You can read more details in this recent preprint we wrote, but a VDS of 35,000 whole genomes should be a few terabytes. I'd guess 4 TiB, but it depends on your reference block granularity. I strongly recommend using size 10 GQ buckets.

Looks like VDS is a better solution than HailMatrix. However, we got the joint call result as vcf alreay. Can VDS Combiner read joint call VCF and then save it as VDS format? I cannot find any example to transfer VCF to VDS. Thanks.

I don't know the Kryo JAR. I tested on both docker images hailgenetics/hail:0.2.126-py3.11 and hailgenetics/hail:0.2.127-py3.11.

Those should use Kryo 4.0.2. OK. My conclusion is that Kryo still has a bug preventing the serialization of very large objects. This becomes a limitation in Hail: we cannot support PLINK files with tens of millions of variants. Our community is largely transitioning to GVCFs and VDS, so I doubt we'll improve our PLINK1 importer to support such large PLINK1 files. That said, PRs are always welcome if loading such large PLINK1 files is a hard requirement for you all.