projectglow / glow

An open-source toolkit for large-scale genomic analysis
https://projectglow.io
Apache License 2.0
264 stars 111 forks source link

java.lang.ArrayIndexOutOfBoundsException when writing to vcf #421

Closed sandra-selfdecode closed 5 months ago

sandra-selfdecode commented 2 years ago

I imported vcfs from several projects and combined them into one delta table. I am now trying to write from the delta table to a vcf, and I keep getting java.lang.ArrayIndexOutOfBoundsException when it tries to write to vcf.

Can you give me suggestions for what might cause this problem? It seems to be related to the genotypes column. It works if I only select genotypes.calls.

Py4JJavaError Traceback (most recent call last)

in ----> 1 extract_vcfs(kg_silver, 'collections_chr', regions='20') in extract_vcfs(delta_path, vcf_prefix, **kwargs) 299 if out_df.count() > 0: 300 #out_df.show() --> 301 out_df.write.format('bigvcf').mode('overwrite').save(f'{vcf_prefix}{contig}.vcf.bgz') 302 303 /databricks/spark/python/pyspark/sql/readwriter.py in save(self, path, format, mode, partitionBy, **options) 1134 self._jwrite.save() 1135 else: -> 1136 self._jwrite.save(path) 1137 1138 @since(1.4) /databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args) 1302 1303 answer = self.gateway_client.send_command(command) -> 1304 return_value = get_return_value( 1305 answer, self.gateway_client, self.target_id, self.name) 1306 /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 108 def deco(*a, **kw): 109 try: --> 110 return f(*a, **kw) 111 except py4j.protocol.Py4JJavaError as e: 112 converted = convert_exception(e.java_exception) /databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client) 325 if answer[1] == REFERENCE_TYPE: --> 326 raise Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". 328 format(target_id, ".", name), value) Py4JJavaError: An error occurred while calling o541.save.
williambrandler commented 2 years ago

Hey Sandra, not sure what the issue is! Please print the schema for the dataframe, and provide some more info about the dataset (num variants and samples), and show the code and full stacktrace please?

henrydavidge commented 5 months ago

Closing since we don't have a reproduction