Open vicruiser opened 4 years ago
I wrote this for a colleagues years ago. As far as I understand my program only handled bases with one character only. Give me a few minutes, I'll update the code for multiple nucleotides.
@vicruiser i pushed https://github.com/lindenb/jvarkit/commit/c81f95bf192fd12ecc81a6b9ea159d419cd91caa , please tell me if that works.
Please not for this kind of deletion T/TA I didn't look at the POS in the VCF. But in the VCF, the position should be shifted https://samtools.github.io/hts-specs/VCFv4.2.pdf
Strings must include the base before the event (which must be reflected in the POS field),
so tell me if you think that the POS is wrong for an INS or a DEL.
It works now! Thanks. I'm going to check carefully the matter about DEL and INS and I'll get back to you.
Hi again,
I think something is wrong with the generated .vcf but not entirely. This is part of the output.vcf for an insertion:
input.bim
10 10:80432024:I:6 0 80432024 CCTACAG C
output.vcf
10 80432024 10:80432024:I:6 C CCTACAG . . MORGAN=0.0;SVTYPE=DEL
and for a deletion:
input.bim
10 10:80509330:D:12 0 80509330 C CCTGGAGCTGGCT
output.vcf
10 80509330 10:80509330:D:12 C CCTGGAGCTGGCT . .MORGAN=0.0;SVTYPE=INS
I can see labels that are supposed to be deletions (e.g.: 1:69726:D:3) categorized as insertion (;SVTYPE=INS) and the other way around.
The reference allele column in my input.bim is the last one. I don't know if that is the problem or is just a problem with labeling.
Additionally, I've run VEP to annotate the .vcf generated with bim2vcf and indeed, something goes wrong with the positions but only for deletions.
Output of vep for insertion (well categorized as insertion):
Uploaded_variation Location Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation Extra
10:80432024:I:6 10:80432024-80432025 CTACAG ENSG00000122378 ENST00000372181 Transcript inframe_insertion 1085-1086 615-616 205-206 -/LQ -/CTACAG - IMPACT=MODERATE;STRAND=1;SOURCE=Homo_sapiens.GRCh38.98.sorted.gtf.gz
10:80432024:I:6 10:80432024-80432025 CTACAG ENSG00000122378 ENST00000372185 Transcript inframe_insertion 750-751 582-583 194-195 -/LQ -/CTACAG
...
Output of vep for deletion (wrongly categorized as insertion):
Uploaded_variation Location Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation Extra
10:80509330:D:12 10:80509330-80509331 CTGGAGCTGGCT ENSG00000108219 ENST00000341863 Transcript intron_variant - - -- - - IMPACT=MODIFIER;STRAND=1;SOURCE=Homo_sapiens.GRCh38.98.sorted.gtf.gz
10:80509330:D:12 10:80509330-80509331 CTGGAGCTGGCT ENSG00000108219 ENST00000372156 Transcript inframe_insertion 811-812 309-310 103-104 -/LELA -/CTGGAGCTGGCT - IMPACT=MODERATE;STRAND=1;SOURCE=Homo_sapiens.GRCh38.98.sorted.gtf.gz
...
So basically, yes, if i'm correct, DEL positions should be shifted. Thanks again.
@vicruiser Hi again, Thanks for the report, I'll check this tomorrow (18H43 here)
handling DEL or INS is quite difficult for me now. There was a bug because I just use one character to get the reference allele. As a quick fix, The DEL or INS are just skipped without error now.
Ok, I'll wait for the final update then. Thanks for your time.
In case someone need it:
Since my knowledge of java is zero, using this version c81f95b, I fixed the the positions of the deletions after the .vcf file was generated with the following command
awk '/D:/{ temp=$4; $4=$5 ; $5=temp } 1' OFS='\t' input.vcf > input_fixed.vcf
Subject of the issue
bim2vcf experiments severe error.
Your environment
Steps to reproduce
java -jar jvarkit/dist/bim2vcf.jar -R Homo_sapiens.GRCh38.dna.primary_assembly.fa input.bim
input.bim looks like:
It seems that bim2vcf has problems dealing with alleles that are numbers.
Expected behaviour
To run fine.
Actual behaviour