opencb / opencga

An Open Computational Genomics Analysis platform for big data genomics analysis. OpenCGA is maintained and develop by its parent company Zetta Genomics. Please contact support@zettagenomics.com for bug report and feature requests.
Apache License 2.0
166 stars 97 forks source link

OpenCGA-1.4.2 Mongo cannot index proper VCFs #1564

Open jlfrueda opened 4 years ago

jlfrueda commented 4 years ago

In

https://github.com/opencb/opencga/issues/1523

we reported that after indexing, OpenCGA leaves variants with duplicated entries for files and samples (duplicated entries in variants.studies.files.fid and variants.studies.gt..), which then cause mongo duplicated key errors when OpenCGA queries sets containing them.

The same error is causing OpenCGA to fail when loading variants, since the variant merger cannot read them either.

In summary, the same error prevents OpenCGA-1.4.2/mongo from indexing VCF files, in addition to making queries fail.

A trace showing the problem:

[...]
2020-03-31 07:03:55 [pool-2-thread-1] WARN  MongoDBVariantMerger:888 - Overlapping variants in file 38 : [17:79502344:CCC:-, 17:79502346:C:A]
2020-03-31 07:03:55 [pool-2-thread-1] ERROR MongoDBVariantMerger:552 - Error processing variant 17:79502344:CCC:- in overlapped variants [17:79502344:CCC:-, 17:79502346:C:A]
2020-03-31 07:03:55 [pool-2-thread-1] ERROR MongoDBVariantMerger:407 - Error processing variant 17:79502346:C:A
java.lang.IllegalStateException: Duplicate key Document{{fid=4, attrs=Document{{AC=1, MQRankSum=0.0, set=FilteredInAll, FILTER=LowDepth, MQ=60.0, AF=0.5, MLEAC=1, BaseQRankSum=-1.718, ExcessHet=3.0103, QUAL=171.76999999999998, MLEAF
=0.5, DP=7, ReadPosRankSum=0.328, AN=2, FS=0.0, QD=24.54, SOR=0.941, ClippingRankSum=0.0}}, sampleData=Document{{ad=org.bson.types.Binary@39619d9, dp=org.bson.types.Binary@b818, gq=org.bson.types.Binary@b874, pl=org.bson.types.Binar
y@d35f6f7}}}}
        at java.util.stream.Collectors.lambda$throwingMerger$0(Collectors.java:133)
        at java.util.HashMap.merge(HashMap.java:1254)
        at java.util.stream.Collectors.lambda$toMap$58(Collectors.java:1320)
        at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
        at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
        at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
        at org.opencb.opencga.storage.mongodb.variant.converters.DocumentToSamplesConverter.convertToDataModelType(DocumentToSamplesConverter.java:191)
        at org.opencb.opencga.storage.mongodb.variant.converters.DocumentToStudyVariantEntryConverter.convertToDataModelType(DocumentToStudyVariantEntryConverter.java:204)
        at org.opencb.opencga.storage.mongodb.variant.converters.DocumentToVariantConverter.convertToDataModelType(DocumentToVariantConverter.java:304)
        at org.opencb.opencga.storage.mongodb.variant.converters.DocumentToVariantConverter.convertToDataModelType(DocumentToVariantConverter.java:39)
        at org.opencb.commons.datastore.mongodb.MongoDBCollection.privateFind(MongoDBCollection.java:256)
        at org.opencb.commons.datastore.mongodb.MongoDBCollection.find(MongoDBCollection.java:205)
        at org.opencb.opencga.storage.mongodb.variant.adaptors.VariantMongoDBAdaptor.get(VariantMongoDBAdaptor.java:501)
        at org.opencb.opencga.storage.mongodb.variant.load.variants.MongoDBVariantMerger.fetchVariant(MongoDBVariantMerger.java:1031)
        at org.opencb.opencga.storage.mongodb.variant.load.variants.MongoDBVariantMerger.mergeOverlappedVariants(MongoDBVariantMerger.java:933)
        at org.opencb.opencga.storage.mongodb.variant.load.variants.MongoDBVariantMerger.processOverlappedVariants(MongoDBVariantMerger.java:649)
        at org.opencb.opencga.storage.mongodb.variant.load.variants.MongoDBVariantMerger.processOverlappedVariants(MongoDBVariantMerger.java:546)
        at org.opencb.opencga.storage.mongodb.variant.load.variants.MongoDBVariantMerger.processVariants(MongoDBVariantMerger.java:398)
        at org.opencb.opencga.storage.mongodb.variant.load.variants.MongoDBVariantMerger.merge(MongoDBVariantMerger.java:363)
        at org.opencb.opencga.storage.mongodb.variant.load.variants.MongoDBVariantMerger.apply(MongoDBVariantMerger.java:291)
        at org.opencb.commons.run.ParallelTaskRunner$TaskRunnable.call(ParallelTaskRunner.java:633)
        at org.opencb.commons.run.ParallelTaskRunner$TaskRunnable.call(ParallelTaskRunner.java:600)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2020-03-31 07:03:55 [pool-2-thread-1] ERROR MongoDBVariantMerger:296 - Fail loading batch from 17:  79496010:C:T to 18:   3168816:G:A
2020-03-31 07:03:55 [pool-2-thread-1] ERROR ParallelTaskRunner:635 - Error processing batch 286
java.lang.IllegalStateException: Duplicate key Document{{fid=4, attrs=Document{{AC=1, MQRankSum=0.0, set=FilteredInAll, FILTER=LowDepth, MQ=60.0, AF=0.5, MLEAC=1, BaseQRankSum=-1.718, ExcessHet=3.0103, QUAL=171.76999999999998, MLEAF=0.5, DP=7, ReadPosRankSum=0.328, AN=2, FS=0.0, QD=24.54, SOR=0.941, ClippingRankSum=0.0}}, sampleData=Document{{ad=org.bson.types.Binary@39619d9, dp=org.bson.types.Binary@b818, gq=org.bson.types.Binary@b874, pl=org.bson.types.Binary@d35f6f7}}}}
        at java.util.stream.Collectors.lambda$throwingMerger$0(Collectors.java:133)
        at java.util.HashMap.merge(HashMap.java:1254)
        at java.util.stream.Collectors.lambda$toMap$58(Collectors.java:1320)
        at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
        at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
        at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
        at org.opencb.opencga.storage.mongodb.variant.converters.DocumentToSamplesConverter.convertToDataModelType(DocumentToSamplesConverter.java:191)
        at org.opencb.opencga.storage.mongodb.variant.converters.DocumentToStudyVariantEntryConverter.convertToDataModelType(DocumentToStudyVariantEntryConverter.java:204)
        at org.opencb.opencga.storage.mongodb.variant.converters.DocumentToVariantConverter.convertToDataModelType(DocumentToVariantConverter.java:304)
        at org.opencb.opencga.storage.mongodb.variant.converters.DocumentToVariantConverter.convertToDataModelType(DocumentToVariantConverter.java:39)
        at org.opencb.commons.datastore.mongodb.MongoDBCollection.privateFind(MongoDBCollection.java:256)
        at org.opencb.commons.datastore.mongodb.MongoDBCollection.find(MongoDBCollection.java:205)
        at org.opencb.opencga.storage.mongodb.variant.adaptors.VariantMongoDBAdaptor.get(VariantMongoDBAdaptor.java:501)
        at org.opencb.opencga.storage.mongodb.variant.load.variants.MongoDBVariantMerger.fetchVariant(MongoDBVariantMerger.java:1031)
        at org.opencb.opencga.storage.mongodb.variant.load.variants.MongoDBVariantMerger.mergeOverlappedVariants(MongoDBVariantMerger.java:933)
        at org.opencb.opencga.storage.mongodb.variant.load.variants.MongoDBVariantMerger.processOverlappedVariants(MongoDBVariantMerger.java:649)
        at org.opencb.opencga.storage.mongodb.variant.load.variants.MongoDBVariantMerger.processOverlappedVariants(MongoDBVariantMerger.java:546)
        at org.opencb.opencga.storage.mongodb.variant.load.variants.MongoDBVariantMerger.processVariants(MongoDBVariantMerger.java:398)
        at org.opencb.opencga.storage.mongodb.variant.load.variants.MongoDBVariantMerger.merge(MongoDBVariantMerger.java:363)
        at org.opencb.opencga.storage.mongodb.variant.load.variants.MongoDBVariantMerger.apply(MongoDBVariantMerger.java:291)
        at org.opencb.commons.run.ParallelTaskRunner$TaskRunnable.call(ParallelTaskRunner.java:633)
        at org.opencb.commons.run.ParallelTaskRunner$TaskRunnable.call(ParallelTaskRunner.java:600)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2020-03-31 07:03:55 [pool-2-thread-1] WARN  ParallelTaskRunner:647 - Abort task thread on fail
[...]
jlfrueda commented 4 years ago

The buggy behavior seems to be related to the batching code in variant transform and load. I set up a couple of very small VCFs (just a hundred variants) for which a fresh opencga produces 2 invalid variants with duplicated keys. Then I reduced the number of threads for transform and loading from default values to just 1. To my surprise (I suspected a sync related bug), the number of incorrect variants doubled, 4 incorrect variants. I then increased the batch size to 50000 and the number of errors went back to 2.

I believe this can help localizing this bug, which is important to make opencga 1.4 with mongo usable.