lindenb / jvarkit

Java utilities for Bioinformatics
https://jvarkit.readthedocs.io/
Other
485 stars 133 forks source link

FixVcfMissingGenotypes can't find bams #37

Closed arundurvasula closed 8 years ago

arundurvasula commented 8 years ago

Hi! Thanks for writing this awesome tool, it's exactly what I need. I'm having some issues running it though. Specifically, I get a warning:

[WARNING/FixVcfMissingGenotypes] 2015-11-25 18:08:30 "No bam to fix sample M-Ait9.22003_GTCTGTCA_C5E9DANXX_6_20140930B_20140930.bam.sam.sortP.sam.duprem.sam.bam"

I created a 2 sample vcf by merging 2 vcfs:

vcf-merge M-Ait14.22010_CAAGACTA_C5E9DANXX_6_20140930B_20140930.bam.mpileup.all.snp.vcfsamp.vcf.vcf.gz M-Ait9.22003_GTCTGTCA_C5E9DANXX_6_20140930B_20140930.bam.mpileup.all.snp.vcfsamp.vcf.vcf.gz > Ait14-Ait9.vcf

Then I made a bam list:

find ./ -name "M-Ait*" > bams.txt

And I reheadered the vcf file to make sure the sample names match:

bcftools reheader -s bams.txt Ait14.Ait9.vcf > reheadered.vcf

Then I run FixVcfMissingGenotypes:

java -jar ~/Documents/Science/software/jvarkit/dist-1.139/fixvcfmissinggenotypes.jar -d 10 -f bams.txt reheadered.vcf > fixed.vcf

But I get the following error:

[INFO/FixVcfMissingGenotypes] 2015-11-25 18:08:22 "Starting JOB at Wed Nov 25 18:08:22 CET 2015 com.github.lindenb.jvarkit.tools.misc.FixVcfMissingGenotypes version=1518314284a5c8c26ae674daeb32ec3badf3b79c  built=2015-11-24:15-11-48"
[INFO/FixVcfMissingGenotypes] 2015-11-25 18:08:22 "Command Line args : -d 10 -f bams.txt reheadered.vcf"
[INFO/FixVcfMissingGenotypes] 2015-11-25 18:08:22 "Executing as arundurvasula@lyell.cibiv.univie.ac.at on Mac OS X 10.9.5 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_79-b15"
[INFO/FixVcfMissingGenotypes] 2015-11-25 18:08:23 "Reading header for M-Ait9.22003_GTCTGTCA_C5E9DANXX_6_20140930B_20140930.bam.sam.sortP.sam.duprem.sam.bam"
[INFO/FixVcfMissingGenotypes] 2015-11-25 18:08:24 "Reading header for M-Ait14.22010_CAAGACTA_C5E9DANXX_6_20140930B_20140930.bam.sam.sortP.sam.duprem.sam.bam"
Nov 25, 2015 6:08:25 PM com.github.lindenb.jvarkit.util.vcf.VCFUtils createVcfIterator
INFO: reading from reheadered.vcf
[INFO/FixVcfMissingGenotypes] 2015-11-25 18:08:25 "Adding 'java.io.tmpdir' directory to the list of tmp directories"
[INFO/FixVcfMissingGenotypes] 2015-11-25 18:08:25 "Sample: M-Ait14.22010_CAAGACTA_C5E9DANXX_6_20140930B_20140930.bam.sam.sortP.sam.duprem.sam.bam"
[WARNING/FixVcfMissingGenotypes] 2015-11-25 18:08:25 "No bam to fix sample M-Ait14.22010_CAAGACTA_C5E9DANXX_6_20140930B_20140930.bam.sam.sortP.sam.duprem.sam.bam"
[INFO/FixVcfMissingGenotypes] 2015-11-25 18:08:30 "done sample M-Ait14.22010_CAAGACTA_C5E9DANXX_6_20140930B_20140930.bam.sam.sortP.sam.duprem.sam.bam fixed=0 not-fixed=0 total=495705 genotypes"
Nov 25, 2015 6:08:30 PM com.github.lindenb.jvarkit.util.vcf.VCFUtils createVcfIteratorFromFile
INFO: reading vcf from /var/folders/xv/172vhqrn0t3d_6w0by980t8r0000gn/T/fixvcf8957134055780526493.vcf
[INFO/FixVcfMissingGenotypes] 2015-11-25 18:08:30 "Sample: M-Ait9.22003_GTCTGTCA_C5E9DANXX_6_20140930B_20140930.bam.sam.sortP.sam.duprem.sam.bam"
[WARNING/FixVcfMissingGenotypes] 2015-11-25 18:08:30 "No bam to fix sample M-Ait9.22003_GTCTGTCA_C5E9DANXX_6_20140930B_20140930.bam.sam.sortP.sam.duprem.sam.bam"
[INFO/FixVcfMissingGenotypes] 2015-11-25 18:08:32 "done sample M-Ait9.22003_GTCTGTCA_C5E9DANXX_6_20140930B_20140930.bam.sam.sortP.sam.duprem.sam.bam fixed=0 not-fixed=0 total=495705 genotypes"
Nov 25, 2015 6:08:32 PM com.github.lindenb.jvarkit.util.vcf.VCFUtils createVcfIteratorFromFile
INFO: reading vcf from /var/folders/xv/172vhqrn0t3d_6w0by980t8r0000gn/T/fixvcf3450591832525368561.vcf
[INFO/FixVcfMissingGenotypes] 2015-11-25 18:08:37 "End JOB status=0 [Wed Nov 25 18:08:37 CET 2015] com.github.lindenb.jvarkit.tools.misc.FixVcfMissingGenotypes done. Elapsed time: 0.25 minutes."

The bam files definitely exist (and are in the current directory):

ls -lh
total 10896016
-rw-r--r--  1 arundurvasula  staff    22M Nov 25 17:40 Ait14-Ait9.vcf
-rw-r--r--  1 arundurvasula  staff   3.2G Nov 24 16:23 M-Ait14.22010_CAAGACTA_C5E9DANXX_6_20140930B_20140930.bam.sam.sortP.sam.duprem.sam.bam
-rw-r--r--  1 arundurvasula  staff   1.9G Nov 24 16:24 M-Ait9.22003_GTCTGTCA_C5E9DANXX_6_20140930B_20140930.bam.sam.sortP.sam.duprem.sam.bam
-rw-r--r--  1 arundurvasula  staff   173B Nov 25 18:08 bams.txt
-rw-r--r--  1 arundurvasula  staff   177B Nov 25 18:03 bams.txt~
-rw-r--r--  1 arundurvasula  staff    22M Nov 25 18:08 fixed.vcf
-rw-r--r--  1 arundurvasula  staff    22M Nov 25 17:46 refs.removed.vcf
-rw-r--r--  1 arundurvasula  staff    22M Nov 25 18:08 reheadered.vcf

But it looks like none of the genotypes were fixed. Any ideas? I noticed that the error gets called here: https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/misc/FixVcfMissingGenotypes.java#L207 after a call from .isEmpty(). Perhaps it isn't reading in the bam files for some reason?

Thanks again for writing this tool!

lindenb commented 8 years ago

I don't see any error in the log you show, just logs/info

But are you sure the BAM contains the name of the sample in the header ?

show me a BAM header please and the CHROM line of the VCF

arundurvasula commented 8 years ago

Actually it's a warning, not an error. But it looks like it doesn't fill in any missing genotypes. Here's the output of samtools view -H:

$ samtools view -H M-Ait14.22010_CAAGACTA_C5E9DANXX_6_20140930B_20140930.bam.sam.sortP.sam.duprem.sam.bam
@HD VN:1.4  SO:coordinate
@SQ SN:Chr1 LN:30427671
@SQ SN:Chr2 LN:19698289
@SQ SN:Chr3 LN:23459830
@SQ SN:Chr4 LN:18585056
@SQ SN:Chr5 LN:26975502
@SQ SN:chloroplast  LN:154478
@SQ SN:mitochondria LN:366924
@SQ SN:hapAC    LN:24696
@SQ SN:hapB LN:56105
@SQ SN:hapC LN:117359
@SQ SN:Ah03_C1  LN:12888
@SQ SN:Ah03_C2  LN:1223
@SQ SN:Ah03_C4  LN:3955
@SQ SN:Ah03_C5  LN:19265
@SQ SN:Ah03_C6  LN:5565
@SQ SN:Ah03_C8  LN:925
@SQ SN:Ah13_C1  LN:46029
@SQ SN:Ah13_C2  LN:1107
@SQ SN:Ah13_C3  LN:2137
@SQ SN:Ah13_C4  LN:13423
@SQ SN:Ah15_Ca1 LN:36209
@SQ SN:Ah15_Ca2 LN:9297
@SQ SN:Ah15_Ca3 LN:6049
@SQ SN:Ah15_Ca5 LN:26447
@SQ SN:Ah15_Ca6 LN:3026
@SQ SN:Ah15_Ca7 LN:6503
@SQ SN:Ah15_Ca8 LN:4372
@SQ SN:Ah15_Ca9 LN:6828
@SQ SN:Ah15_Cu2 LN:2968
@SQ SN:Ah15_Cu3 LN:26001
@SQ SN:Ah15_Cu4 LN:14428
@SQ SN:Ah20_C2  LN:28216
@SQ SN:Ah20_C3  LN:6401
@SQ SN:Ah20_C4  LN:4409
@SQ SN:Ah32_C1  LN:24522
@SQ SN:Ah32_C3  LN:2507
@SQ SN:Ah32_C4  LN:15146
@SQ SN:CapsellaRubella  LN:72853
@SQ SN:lyrata_S16   LN:99048
@SQ SN:lyrata_S38   LN:85472
@SQ SN:lyrata_S50   LN:115380
@SQ SN:lyrata_Sb    LN:116781
@SQ SN:Ah03C3_embl  LN:12754
@SQ SN:Ah13C5_embl  LN:25196
@SQ SN:Ah15Ca4_embl LN:9812
@SQ SN:Ah15Cu1_embl LN:41660
@SQ SN:Ah20C1_embl  LN:42542
@SQ SN:Ah28U1_embl  LN:68975
@SQ SN:Ah28U2_embl  LN:32554
@SQ SN:Ah28Ca1_embl LN:63904
@SQ SN:Ah28Ca3_embl LN:17474
@SQ SN:Ah32C2_embl  LN:25079
@SQ SN:Ah32C5_embl  LN:47589
@SQ SN:Ah43C3_embl  LN:3150
@SQ SN:Ah43C4_embl  LN:829
@SQ SN:Ah43C5_embl  LN:34958
@SQ SN:Ah43C6_embl  LN:5206
@SQ SN:Ah43C7_embl  LN:36265
@SQ SN:Ah43C8_embl  LN:4856
@SQ SN:Al01C1_embl  LN:50758
@SQ SN:Al01C2_embl  LN:30867
@SQ SN:Al14C1_embl  LN:71234
@SQ SN:Al14C2_embl  LN:23881
@SQ SN:Al18Ca1_embl LN:48782
@SQ SN:Al18Ca2_embl LN:23297
@SQ SN:Al18Ca3_embl LN:18188
@SQ SN:Al18Cu1_embl LN:21997
@SQ SN:Al18Cu2_embl LN:39655
@SQ SN:Al39C1_embl  LN:12578
@SQ SN:Al39C2_embl  LN:3366
@SQ SN:Al39C3_embl  LN:18739
@SQ SN:Al39C4_embl  LN:32763
@SQ SN:Al39C5_embl  LN:22614
@SQ SN:AhSRK12_EU075133.1   LN:260
@SQ SN:AhSRK21_EU075142.1   LN:260
@SQ SN:AhSRK27_EU878012.1   LN:260
@SQ SN:AkSRKD   LN:260
@SQ SN:AlSRK02_AY186763.1   LN:260
@SQ SN:AlSRK07_AY186765.1   LN:260
@SQ SN:AlSRK15_AY186771.1   LN:260
@SQ SN:AlSRK23_AF328997.2   LN:260
@SQ SN:AlSRK35_EU878021.1   LN:260
@SQ SN:AlSRK45_EU878026.1   LN:260
@PG ID:bwa  PN:bwa  VN:0.7.5a-r405
@PG ID:MarkDuplicates   PN:MarkDuplicates   VN:1.100(1571)  CL:net.sf.picard.sam.MarkDuplicates INPUT=[/scratch/myCallsIntermediate/22010/M-Ait14.22010_CAAGACTA_C5E9DANXX_6_20140930B_20140930.bam.sam.sortP.sam] OUTPUT=/scratch/myCallsIntermediate/22010/M-Ait14.22010_CAAGACTA_C5E9DANXX_6_20140930B_20140930.bam.sam.sortP.sam.duprem.sam METRICS_FILE=pe.dupstat.txt REMOVE_DUPLICATES=true VALIDATION_STRINGENCY=SILENT    PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false

And the vcf line:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  M-Ait14.22010_CAAGACTA_C5E9DANXX_6_20140930B_20140930.bam.sam.sortP.sam.duprem.sam.bam  M-Ait9.22003_GTCTGTCA_C5E9DANXX_6_20140930B_20140930.bam.sam.sortP.sam.duprem.sam.bam
lindenb commented 8 years ago

There is no @RG group in your BAM. See the SAM spec and http://broadinstitute.github.io/picard/command-line-overview.html#AddOrReplaceReadGroups