openvar / variantValidator

Public repository for VariantValidator project
GNU Affero General Public License v3.0
67 stars 21 forks source link

Batch job of ClinVar deletion variants failing #509

Closed leicray closed 1 year ago

leicray commented 1 year ago

Describe the bug A user reports that a batch job comprising just deletion variants taken from ClinVar has failed to complete. The variants are in the attached file clinvar_hgvs2.txt.

Several of the variants spanned undefined regions which cannot be validated. Other variants reported deleted nucleotides unnecessarily. I requested that these trailing nucleotides be removed from the variant descriptions and that variants spanning undefined regions also be removed from the file.

A revised file (clinvar_hgvs3.txt) was sent by the user, but that file contained more variants, not fewer. A correctly edited version of the original file has been requested to allow meaningful comparison of batch jobs submitted with the original data and the cleaned-up data.

clinvar_hgvs2.txt clinvar_hgvs3.txt

leicray commented 1 year ago

The user has now provided another version of the data file:

clinvar_hgvs4.txt

leicray commented 1 year ago

For information, the variant LRG_161:g.40565_40708del (line 9 in clainvar_hgvs4.txt) fails to validate and triggers an ERROR message.

In case it might be relevant Ensembl reports "There is no ungapped mapping of this gene onto the GRCh37 assembly."

leicray commented 1 year ago

The user subsequently subdivided the variants into two groups: those for which the genomic reference sequences were derived from GRCh37 (3909 variants) and those for which the genomic reference sequences were derived from GRCh38 (837 variants). The two files are available below.

The 837 GRCh38 variants validated correctly when submitted as a single job.

The 3909 GRCh37 variants failed to validate when submitted as a single job. However, the submitter confirms that sub-dividing the variant descriptions into smaller groups eventually enabled all variants to be validated.

I have confirmed this but still noted some anomalies with regard to the GRCh37 variants. The findings are summarised in the attached Excel spreadsheet.

In summary, the GRCh37 variants could be validated, but not simply by simple sub-division of the variant descriptions into smaller groups. Failed small groups could eventually be validated if subdivided into yet smaller groups. This suggests that there are no variant descriptions in the starting group that are inherently incapable of being validated. The problem must be related to the internal workflow of the batch tool.

clinvar_hgvs_GRCh37.txt

clinvar_hgvs_GRCh38.txt

job log.xlsx

leicray commented 1 year ago

The "chunks" of GRCh37 variants that fail are:

When lines 3601-3800 were split into 3601-3700 & 3701-3800, each chunk of 100 variants validated correctly.

When lines 3805-3909 were submitted, all variants validated correctly.

Peter-J-Freeman commented 1 year ago

Have run both these files and debugged. All variants now run. Data committed to servers