lindenb / jvarkit

Java utilities for Bioinformatics
https://jvarkit.readthedocs.io/
Other
485 stars 133 forks source link

[INFO][Launcher]bim2vcf Exited with failure (-1) #144

Open vicruiser opened 4 years ago

vicruiser commented 4 years ago

Subject of the issue

bim2vcf experiments severe error.

Your environment

Steps to reproduce

java -jar jvarkit/dist/bim2vcf.jar -R Homo_sapiens.GRCh38.dna.primary_assembly.fa input.bim

input.bim looks like:

1   1:69081:G:C 0   69081   C   G
1   1:69149:T:A 0   69149   A   T
1   1:69224:A:T 0   69224   T   A
1   1:69270:A:G 0   69270   G   A
1   1:69281:T:C 0   69281   C   T
1   1:69334:C:T 0   69334   T   C
...
1   1:69610:C:T 0   69610   T   C
1   1:69620:D:1 0   69620   T   TA << here the problem appears
1   1:69634:T:A 0   69634   A   T
1   1:69640:G:C 0   69640   C   G
1   1:69655:G:C 0   69655   C   G
1   1:69697:T:C 0   69697   C   T
1   1:69702:D:1 0   69702   T   TG
1   1:69726:D:3 0   69726   T   TATC

It seems that bim2vcf has problems dealing with alleles that are numbers.

Expected behaviour

To run fine.

Actual behaviour

##fileformat=VCFv4.2
##INFO=<ID=MORGAN,Number=1,Type=Float,Description="Centimorgan">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Variation type">
##contig=<ID=1,length=248956422>
....
##contig=<ID=KI270392.1,length=971>
##contig=<ID=KI270394.1,length=970>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
1   69081   1:69081:G:C G   C   .   .   MORGAN=0.0;SVTYPE=SNV
1   69149   1:69149:T:A T   A   .   .   MORGAN=0.0;SVTYPE=SNV
1   69224   1:69224:A:T A   T   .   .   MORGAN=0.0;SVTYPE=SNV
1   69270   1:69270:A:G A   G   .   .   MORGAN=0.0;SVTYPE=SNV
1   69281   1:69281:T:C T   C   .   .   MORGAN=0.0;SVTYPE=SNV
1   69334   1:69334:C:T C   T   .   .   MORGAN=0.0;SVTYPE=SNV
1   69351   1:69351:C:G C   G   .   .   MORGAN=0.0;SVTYPE=SNV
1   69360   1:69360:C:T C   T   .   .   MORGAN=0.0;SVTYPE=SNV
1   69366   1:69366:T:G T   G   .   .   MORGAN=0.0;SVTYPE=SNV
1   69369   1:69369:G:A G   A   .   .   MORGAN=0.0;SVTYPE=SNV
1   69428   1:69428:T:G T   G   .   .   MORGAN=0.0;SVTYPE=SNV
1   69438   1:69438:T:C T   C   .   .   MORGAN=0.0;SVTYPE=SNV
1   69462   1:69462:C:T C   T   .   .   MORGAN=0.0;SVTYPE=SNV
1   69486   1:69486:C:T C   T   .   .   MORGAN=0.0;SVTYPE=SNV
1   69489   1:69489:A:C A   C   .   .   MORGAN=0.0;SVTYPE=SNV
1   69493   1:69493:G:A G   A   .   .   MORGAN=0.0;SVTYPE=SNV
1   69495   1:69495:C:T C   T   .   .   MORGAN=0.0;SVTYPE=SNV
1   69496   1:69496:G:A G   A   .   .   MORGAN=0.0;SVTYPE=SNV
1   69499   1:69499:A:G A   G   .   .   MORGAN=0.0;SVTYPE=SNV
1   69511   1:69511:A:G A   G   .   .   MORGAN=0.0;SVTYPE=SNV
1   69511   1:69511:A:T A   T   .   .   MORGAN=0.0;SVTYPE=SNV
1   69521   1:69521:T:C T   C   .   .   MORGAN=0.0;SVTYPE=SNV
1   69534   1:69534:T:C T   C   .   .   MORGAN=0.0;SVTYPE=SNV
1   69536   1:69536:C:T C   T   .   .   MORGAN=0.0;SVTYPE=SNV
1   69552   1:69552:G:C G   C   .   .   MORGAN=0.0;SVTYPE=SNV
1   69555   1:69555:T:G T   G   .   .   MORGAN=0.0;SVTYPE=SNV
1   69559   1:69559:G:A G   A   .   .   MORGAN=0.0;SVTYPE=SNV
1   69561   1:69561:G:T G   T   .   .   MORGAN=0.0;SVTYPE=SNV
1   69584   1:69584:A:G A   G   .   .   MORGAN=0.0;SVTYPE=SNV
1   69594   1:69594:T:C T   C   .   .   MORGAN=0.0;SVTYPE=SNV
1   69603   1:69603:T:C T   C   .   .   MORGAN=0.0;SVTYPE=SNV
1   69604   1:69604:T:C T   C   .   .   MORGAN=0.0;SVTYPE=SNV
1   69610   1:69610:C:T C   T   .   .   MORGAN=0.0;SVTYPE=SNV
[SEVERE][BimToVcf]not handled: 1    1:69620:D:1 0   69620   T   TA
[INFO][Launcher]bim2vcf Exited with failure (-1)
lindenb commented 4 years ago

I wrote this for a colleagues years ago. As far as I understand my program only handled bases with one character only. Give me a few minutes, I'll update the code for multiple nucleotides.

lindenb commented 4 years ago

@vicruiser i pushed https://github.com/lindenb/jvarkit/commit/c81f95bf192fd12ecc81a6b9ea159d419cd91caa , please tell me if that works.

Please not for this kind of deletion T/TA I didn't look at the POS in the VCF. But in the VCF, the position should be shifted https://samtools.github.io/hts-specs/VCFv4.2.pdf

Strings must include the base before the event (which must be reflected in the POS field),

so tell me if you think that the POS is wrong for an INS or a DEL.

vicruiser commented 4 years ago

It works now! Thanks. I'm going to check carefully the matter about DEL and INS and I'll get back to you.

vicruiser commented 4 years ago

Hi again,

I think something is wrong with the generated .vcf but not entirely. This is part of the output.vcf for an insertion:

input.bim
10    10:80432024:I:6    0    80432024    CCTACAG    C
output.vcf
10    80432024    10:80432024:I:6    C    CCTACAG    .    .    MORGAN=0.0;SVTYPE=DEL

and for a deletion:

input.bim
10    10:80509330:D:12    0    80509330    C    CCTGGAGCTGGCT
output.vcf
10    80509330    10:80509330:D:12    C    CCTGGAGCTGGCT    .    .MORGAN=0.0;SVTYPE=INS

I can see labels that are supposed to be deletions (e.g.: 1:69726:D:3) categorized as insertion (;SVTYPE=INS) and the other way around.

The reference allele column in my input.bim is the last one. I don't know if that is the problem or is just a problem with labeling.

Additionally, I've run VEP to annotate the .vcf generated with bim2vcf and indeed, something goes wrong with the positions but only for deletions.

Output of vep for insertion (well categorized as insertion):

Uploaded_variation  Location    Allele  Gene    Feature Feature_type    Consequence cDNA_position   CDS_position    Protein_position    Amino_acids Codons  Existing_variation  Extra
10:80432024:I:6    10:80432024-80432025    CTACAG    ENSG00000122378    ENST00000372181    Transcript    inframe_insertion    1085-1086    615-616    205-206    -/LQ    -/CTACAG    -    IMPACT=MODERATE;STRAND=1;SOURCE=Homo_sapiens.GRCh38.98.sorted.gtf.gz
10:80432024:I:6    10:80432024-80432025    CTACAG    ENSG00000122378    ENST00000372185    Transcript    inframe_insertion    750-751    582-583    194-195    -/LQ    -/CTACAG    
...

Output of vep for deletion (wrongly categorized as insertion):

Uploaded_variation  Location    Allele  Gene    Feature Feature_type    Consequence cDNA_position   CDS_position    Protein_position    Amino_acids Codons  Existing_variation  Extra
10:80509330:D:12    10:80509330-80509331    CTGGAGCTGGCT    ENSG00000108219    ENST00000341863    Transcript    intron_variant    -    -    --    -    -    IMPACT=MODIFIER;STRAND=1;SOURCE=Homo_sapiens.GRCh38.98.sorted.gtf.gz
10:80509330:D:12    10:80509330-80509331    CTGGAGCTGGCT    ENSG00000108219    ENST00000372156    Transcript    inframe_insertion    811-812    309-310    103-104    -/LELA    -/CTGGAGCTGGCT    -    IMPACT=MODERATE;STRAND=1;SOURCE=Homo_sapiens.GRCh38.98.sorted.gtf.gz
...

So basically, yes, if i'm correct, DEL positions should be shifted. Thanks again.

lindenb commented 4 years ago

@vicruiser Hi again, Thanks for the report, I'll check this tomorrow (18H43 here)

lindenb commented 4 years ago

handling DEL or INS is quite difficult for me now. There was a bug because I just use one character to get the reference allele. As a quick fix, The DEL or INS are just skipped without error now.

vicruiser commented 4 years ago

Ok, I'll wait for the final update then. Thanks for your time.

vicruiser commented 4 years ago

In case someone need it: Since my knowledge of java is zero, using this version c81f95b, I fixed the the positions of the deletions after the .vcf file was generated with the following command awk '/D:/{ temp=$4; $4=$5 ; $5=temp } 1' OFS='\t' input.vcf > input_fixed.vcf