Open jlfrueda opened 4 years ago
The buggy behavior seems to be related to the batching code in variant transform and load. I set up a couple of very small VCFs (just a hundred variants) for which a fresh opencga produces 2 invalid variants with duplicated keys. Then I reduced the number of threads for transform and loading from default values to just 1. To my surprise (I suspected a sync related bug), the number of incorrect variants doubled, 4 incorrect variants. I then increased the batch size to 50000 and the number of errors went back to 2.
I believe this can help localizing this bug, which is important to make opencga 1.4 with mongo usable.
In
https://github.com/opencb/opencga/issues/1523
we reported that after indexing, OpenCGA leaves variants with duplicated entries for files and samples (duplicated entries in variants.studies.files.fid and variants.studies.gt..), which then cause mongo duplicated key errors when OpenCGA queries sets containing them.
The same error is causing OpenCGA to fail when loading variants, since the variant merger cannot read them either.
In summary, the same error prevents OpenCGA-1.4.2/mongo from indexing VCF files, in addition to making queries fail.
A trace showing the problem: