Closed MaoYafei closed 6 years ago
the bam list is a list of bam and must be ended with '--' . I'll update the doc. Can you tell me if it fixes the problem:
java -jar /work/student/yafei-mao/jvarkit/dist-2.0.1/fixvcfmissinggenotypes.jar -d 20 -f bam_missing.txt -- test.vcf>test_2_new.vcf
or
java -jar /work/student/yafei-mao/jvarkit/dist-2.0.1/fixvcfmissinggenotypes.jar -d 20 -f bam_missing.txt < test.vcf>test_2_new.vcf
Thanks your reply^^, it seems like working now, but still have some problem in my case.
looks like below:
^C[yafei-mao@sango10513 mapping_new]$ java -jar /work/student/yafei-mao/jvarkit/dist-2.0.1/fixvcfmissinggenotypes.jar -d 20 -f bam_missing.txt
Thanks again
your sample is 'bar.Adif'. Do you have a ReadGroup with sample =bar.Adif in the BAM ?
I think so. I do the name for bam file, looking like: java -jar picard-tools-2.1.0/picard.jar AddOrReplaceReadGroups I=Adif_group_recal_reads.bam O=bar.Adif.bam RGID=4 RGLB=library1 RGPL=illumina RGPU=unit1 RGSM=bar.Adif
hum, just checking, can you please change the name of bam_missing.txt
to bam_missing.list
please
looks like working now, but still running. I will update how is going^^ [yafei-mao@sango10513 mapping_new]$ java -jar /work/student/yafei-mao/jvarkit/dist-2.0.1/fixvcfmissinggenotypes.jar -d 20 -f bam_missing.list -- test.vcf>test_2_new.vcf [main] INFO jvarkit - Starting JOB at Fri Mar 11 17:13:27 JST 2016 com.github.lindenb.jvarkit.tools.misc.FixVcfMissingGenotypes version=2b998754ad124603692f68473c4559d8235c3c1c built=2016-03-09:09-03-28 [main] INFO jvarkit - Command Line args : -d 20 -f bam_missing.list -- test.vcf [main] INFO jvarkit - Executing as yafei-mao@sango10513 on Linux 2.6.32-431.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_20-b26 [main] INFO jvarkit - Reading header for bar.Adif.bam [main] INFO jvarkit - Reading header for bar.Aech.bam [main] INFO jvarkit - Reading header for bar.Agem.bam [main] INFO jvarkit - Reading header for bar.Asub.bam Mar 11, 2016 5:13:27 PM com.github.lindenb.jvarkit.util.vcf.VCFUtils createVcfIterator INFO: reading from test.vcf [main] INFO jvarkit - Adding 'java.io.tmpdir' directory to the list of tmp directories [main] INFO jvarkit - Sample: bar.Adif [main] INFO jvarkit - Opening bar.Adif.bam
Thanks a lot^^
Yafei
Sorry for bothering again, one more curious, how long usually it takes for whole running? my running still is on 'Opening bar.Adif.bam', how could I check it is running well or not?
it depends of the size of the VCF. There is usually a tmp vcf file that is created (see the LOG), you could see if it is growing or not.
I have 10G VCF file, and i did not log file in my directory and /tmp directory. Also, in my case, test_2_new.vcf has no changed, still nothing.
test_2_new.vcf is created at the end. 10G is large: for each genotype, there is a random access in the BAM file: for a large VCF that will be a very long process.
I've updated the code and added a progress log : https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/misc/FixVcfMissingGenotypes.java#L136
then, you should see the speed of the program.
I got it, but it seems like that it will take long time=-=, java -jar jvarkit/dist-2.0.1/fixvcfmissinggenotypes.jar -d 20 -f bam_missing.list -- test.vcf >test_3_new.vcf [main] INFO jvarkit - Starting JOB at Fri Mar 11 20:39:53 JST 2016 com.github.lindenb.jvarkit.tools.misc.FixVcfMissingGenotypes version=894f5cbfc8e8c137aa41f974497cee3d82227c6b built=2016-03-11:20-03-12 [main] INFO jvarkit - Command Line args : -d 20 -f bam_missing.list -- test.vcf [main] INFO jvarkit - Executing as yafei-mao@sango11005 on Linux 2.6.32-431.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_20-b26 [main] INFO jvarkit - Reading header for ./bar.Adif.bam [main] INFO jvarkit - Reading header for ./bar.Aech.bam [main] INFO jvarkit - Reading header for ./bar.Agem.bam [main] INFO jvarkit - Reading header for ./bar.Asub.bam Mar 11, 2016 8:39:53 PM com.github.lindenb.jvarkit.util.vcf.VCFUtils createVcfIterator INFO: reading from test.vcf [main] INFO jvarkit - Adding 'java.io.tmpdir' directory to the list of tmp directories [main] INFO jvarkit - Sample: bar.Adif [main] INFO jvarkit - Opening ./bar.Adif.bam [main] INFO jvarkit - Count: 2379 Elapsed: 10 seconds(0.01%) Remains: 1 day(99.99%) Last: sc0000012:37392 [main] INFO jvarkit - Count: 4483 Elapsed: 20 seconds(0.02%) Remains: 1 day(99.98%) Last: sc0000012:65352 [main] INFO jvarkit - Count: 6795 Elapsed: 30 seconds(0.02%) Remains: 1 day(99.98%) Last: sc0000012:89880 [main] INFO jvarkit - Count: 8538 Elapsed: 40 seconds(0.02%) Remains: 1 day(99.98%) Last: sc0000012:107312
BTW, does it show that reading bam file will take 1 day or whole running will take 1 day ?
Plus, I think you are pretty professional at mapping reads to ref genome filed. Do you mind giving some suggestions for my study?
Actually, I have five very high coverage species, each of them have more than 100X coverage including mate pair data. Because I need to do de novo assembly for each of them. And also, I mapping four species reads to one outgroup species which have already assembled. I follow below pipeline:
bwa index Aten.fa
bwa mem -R '@RG\tID:foo\tSM:bar\tLB:library1' -t 24 Aten.fa Adif_R1_trim.fastq Adif_R2_trim.fastq > Adif.sam [do for each of four respectively]
samtools view -bS Adif.sam -o Adif.bam
samtools fixmate -O bam Adif.sam Adif_fixmate.bam
samtools sort -@ 24 -O bam -o Adif_sorted.bam -T /tmp/Adif_temp Adif_fixmate.bam
samtools faidx Aten.fa
java -jar picard-tools-1.70/CreateSequenceDictionary.jar REFERENCE=Aten.fa OUTPUT=Aten.dict
cp picard-tools-2.1.0/libIntelDeflater.so .
java -jar picard-tools-2.1.0/picard.jar MarkDuplicates INPUT=Adif_sorted.bam OUTPUT=Adif_DM_sorted.bam METRICS_FILE=Adif.bam.metrics
samtools index Adif_DM_sorted.bam
java -jar picard-tools-2.1.0/picard.jar CreateSequenceDictionary R= Aten.fa O= Aten.dict
samtools faidx Aten.fa
java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R Aten.fa -I Adif_DM_sorted.bam -o Adif_realignment_targets.list
java -jar GenomeAnalysisTK.jar -T IndelRealigner -R Aten.fa -I Adif_DM_sorted.bam -targetIntervals Adif_realignment_targets.list -o Adif_realigned_reads.bam
java -jar GenomeAnalysisTK.jar -R Aten.fa -T UnifiedGenotyper -I Adif_realigned_reads.bam -o Adif_gatk.raw.vcf -stand_call_conf 30.0 -stand_emit_conf 0 -glm BOTH -rf BadCigar
samtools mpileup -ugf Aten.fa Adif_realigned_reads.bam | bcftools call -vmO v -o Adif.samtools.raw.vcf
java -Xmx4g -jar GenomeAnalysisTK.jar -R Aten.fa -T SelectVariants --variant Adif_gatk.raw.vcf --concordance Adif.samtools.raw.vcf -o Adif.concordance.raw.vcf
java -jar GenomeAnalysisTK.jar -T CombineVariants -R Aten.fa --variant Adif.concordance.raw.vcf --variant Adif.concordance.raw.vcf .... -o output.vcf -genotypeMergeOptions UNIQUIFY
java -jar GenomeAnalysisTK.jar -T SelectVariants -nt 24 -R Aten.fa -V five_combine.vcf -selectType SNP -o five_combine_snp.vcf
java -jar GenomeAnalysisTK.jar -T VariantFiltration -R Aten.fa -V five_combine_snp.vcf --filterExpression "QD < 20.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -4 || ReadPosRankSum < -2.0 || HaplotypeScore > 10.0" --filterName "five_combine_snp_filter" -o five_combine_filtered_snps.vcf
And I use this five_combine_filtered_snps.vcf for fixvcfmissinggenotypes^^
Also, I tried to follow best recommendation in GATK: doing BQSR
java -Xmx4g -jar GenomeAnalysisTK.jar -R Aten.fa -T VariantFiltration --filterExpression " QD < 20.0 || ReadPosRankSum < -8.0 || FS > 10.0 " --filterName LowQualFilter --missingValuesInExpressionsShouldEvaluateAsFailing --variant Adif.concordance.raw.vcf --logging_level ERROR -o Adif.concordance.flt.vcf grep -v "Filter" Adif.concordance.flt.vcf > Adif.confidence.raw.vcf java -jar picard-tools-2.1.0/picard.jar AddOrReplaceReadGroups I=Adif_realigned_reads.bam O=Adif_group_realigned_reads.bam RGID=4 RGLB=library1 RGPL=illumina RGPU=unit1 RGSM=Adif samtools index Adif_group_realigned_reads.bam java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R Aten.fa -I Adif_group_realigned_reads.bam -knownSites Adif.confidence.raw.vcf -o Adif_recal_data.table
java -jar GenomeAnalysisTK.jar -T PrintReads -R Aten.fa -I Adif_group_realigned_reads.bam -BQSR Adif_recal_data.table -o Adif_recal_reads.bam java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -nct 24 -R Aten.fa -I Adif_recal_reads.bam --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -o Adif_raw_variants.vcf -ERC GVCF java -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R Aten.fa -nt 24 -V Adif_raw_variants.g.vcf .... -o teds.vcf
but I follow this recommendation pipeline, I got very low base quality for Adif_recal_reads.bam (no qscore greater than 20). I think I donot have good known SNP data to do BQSR so that I got low base quality bam, and also I found some researchers think that if you had high coverage and illumina data, you do not have to do BQSR.
My purpose for doing mapping is that I could get SNP data (every site should contain all of species, no missing data) to do downstream analysis. And then, do you think my first pipeline (only do realign and choose concordance SNP from gatk and samtool, and filter low quality site, and combine them together) is enough for my purpose?
Sorry for many questions, Thanks again^^
show that reading bam file will take 1 day or whole running will take 1 day ?
one day per SAMPLE.
again: the BAM index is used for each position in the VCF . there is a lot of I/O and It could take a huge time.
I think you are pretty professional at mapping reads to ref genome filed. Do you mind giving some suggestions for my study?
please ask biostars.org
Subject of the issue
I have already compile successfully for fixvcfmissinggenotypes, i think^^. And I try to run it on my linux cluster, but I found it is stuck in reading vcf from stream for two days. But I can not figure out by myself.
Your environment
It is recent release version, directly download from github
/apps/free/java-jdk/1.8.0_20/bin/java
${JAVA_HOME}
/apps/free/java-jdk/1.8.0_20/bin/java
Linux version 2.6.32-431.el6.x86_64
Steps to reproduce
$ java -jar /work/student/yafei-mao/jvarkit/dist-2.0.1/fixvcfmissinggenotypes.jar -d 20 -f bam_missing.txt test.vcf>test_2_new.vcf [main] INFO jvarkit - Starting JOB at Thu Mar 10 09:11:40 JST 2016 com.github.lindenb.jvarkit.tools.misc.FixVcfMissingGenotypes version=2b998754ad124603692f68473c4559d8235c3c1c built=2016-03-09:09-03-28 [main] INFO jvarkit - Command Line args : -d 20 -f bam_missing.txt test.vcf [main] INFO jvarkit - Executing as yafei-mao@sango10513 on Linux 2.6.32-431.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02 [main] INFO jvarkit - Reading header for test.vcf DEBUG 2016-03-10 09:11:40 SRAAccession Checking if SRA module is supported in that environment [main] INFO jvarkit - Reading header for bam_missing.txt Mar 10, 2016 9:11:43 AM com.github.lindenb.jvarkit.util.vcf.VCFUtils createVcfIteratorFromStream INFO: reading vcf from stream
Could you give some idea to help me figure out? Thanks a lot~
Yafei