lindenb / jvarkit

Java utilities for Bioinformatics
https://jvarkit.readthedocs.io/
Other
482 stars 133 forks source link

fixvcfmissinggenotypes stuck at reading vcf from stream #48

Closed MaoYafei closed 6 years ago

MaoYafei commented 8 years ago

Subject of the issue

I have already compile successfully for fixvcfmissinggenotypes, i think^^. And I try to run it on my linux cluster, but I found it is stuck in reading vcf from stream for two days. But I can not figure out by myself.

Your environment

It is recent release version, directly download from github

/apps/free/java-jdk/1.8.0_20/bin/java

/apps/free/java-jdk/1.8.0_20/bin/java

Linux version 2.6.32-431.el6.x86_64

Steps to reproduce

$ java -jar /work/student/yafei-mao/jvarkit/dist-2.0.1/fixvcfmissinggenotypes.jar -d 20 -f bam_missing.txt test.vcf>test_2_new.vcf [main] INFO jvarkit - Starting JOB at Thu Mar 10 09:11:40 JST 2016 com.github.lindenb.jvarkit.tools.misc.FixVcfMissingGenotypes version=2b998754ad124603692f68473c4559d8235c3c1c built=2016-03-09:09-03-28 [main] INFO jvarkit - Command Line args : -d 20 -f bam_missing.txt test.vcf [main] INFO jvarkit - Executing as yafei-mao@sango10513 on Linux 2.6.32-431.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02 [main] INFO jvarkit - Reading header for test.vcf DEBUG 2016-03-10 09:11:40 SRAAccession Checking if SRA module is supported in that environment [main] INFO jvarkit - Reading header for bam_missing.txt Mar 10, 2016 9:11:43 AM com.github.lindenb.jvarkit.util.vcf.VCFUtils createVcfIteratorFromStream INFO: reading vcf from stream

Could you give some idea to help me figure out? Thanks a lot~

Yafei

lindenb commented 8 years ago

the bam list is a list of bam and must be ended with '--' . I'll update the doc. Can you tell me if it fixes the problem:

 java -jar /work/student/yafei-mao/jvarkit/dist-2.0.1/fixvcfmissinggenotypes.jar -d 20 -f bam_missing.txt -- test.vcf>test_2_new.vcf

or

 java -jar /work/student/yafei-mao/jvarkit/dist-2.0.1/fixvcfmissinggenotypes.jar -d 20 -f bam_missing.txt  < test.vcf>test_2_new.vcf
MaoYafei commented 8 years ago

Thanks your reply^^, it seems like working now, but still have some problem in my case. looks like below: ^C[yafei-mao@sango10513 mapping_new]$ java -jar /work/student/yafei-mao/jvarkit/dist-2.0.1/fixvcfmissinggenotypes.jar -d 20 -f bam_missing.txt test_2_new.vcf [main] INFO jvarkit - Starting JOB at Fri Mar 11 16:50:36 JST 2016 com.github.lindenb.jvarkit.tools.misc.FixVcfMissingGenotypes version=2b998754ad124603692f68473c4559d8235c3c1c built=2016-03-09:09-03-28 [main] INFO jvarkit - Command Line args : -d 20 -f bam_missing.txt -- [main] INFO jvarkit - Executing as yafei-mao@sango10513 on Linux 2.6.32-431.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_20-b26 [main] INFO jvarkit - Reading header for bam_missing.txt DEBUG 2016-03-11 16:50:36 SRAAccession Checking if SRA module is supported in that environment Mar 11, 2016 4:50:37 PM com.github.lindenb.jvarkit.util.vcf.VCFUtils createVcfIteratorFromStream INFO: reading vcf from stream [main] INFO jvarkit - Adding 'java.io.tmpdir' directory to the list of tmp directories [main] INFO jvarkit - Sample: bar.Adif [main] WARN jvarkit - No bam to fix sample bar.Adif ^C[yafei-mao@sango10513 mapping_new]$ ll bar -rw-r--r-- 1 yafei-mao * 58553715363 Mar 8 20:57 bar.Adif.bam -rw-r--r-- 1 yafei-mao * 1890840 Mar 10 08:58 bar.Adif.bam.bai -rw-r--r-- 1 yafei-mao * 38256074088 Mar 8 20:03 bar.Aech.bam -rw-r--r-- 1 yafei-mao * 1592712 Mar 10 08:54 bar.Aech.bam.bai -rw-r--r-- 1 yafei-mao* 35346906206 Mar 8 19:59 bar.Agem.bam -rw-r--r-- 1 yafei-mao * 1606104 Mar 10 08:54 bar.Agem.bam.bai -rw-r--r-- 1 yafei-mao * 34751975488 Mar 8 19:59 bar.Asub.bam -rw-r--r-- 1 yafei-mao \ 1603176 Mar 10 08:54 bar.Asub.bam.bai [yafei-mao@sango10513 mapping_new]$ less bam_missing.txt [yafei-mao@sango10513 mapping_new]$ cat bam_missing.txt ./bar.Agem.bam ./bar.Asub.bam ./bar.Adif.bam ./bar.Aech.bam

Thanks again

lindenb commented 8 years ago

your sample is 'bar.Adif'. Do you have a ReadGroup with sample =bar.Adif in the BAM ?

MaoYafei commented 8 years ago

I think so. I do the name for bam file, looking like: java -jar picard-tools-2.1.0/picard.jar AddOrReplaceReadGroups I=Adif_group_recal_reads.bam O=bar.Adif.bam RGID=4 RGLB=library1 RGPL=illumina RGPU=unit1 RGSM=bar.Adif

lindenb commented 8 years ago

hum, just checking, can you please change the name of bam_missing.txt to bam_missing.list please

MaoYafei commented 8 years ago

looks like working now, but still running. I will update how is going^^ [yafei-mao@sango10513 mapping_new]$ java -jar /work/student/yafei-mao/jvarkit/dist-2.0.1/fixvcfmissinggenotypes.jar -d 20 -f bam_missing.list -- test.vcf>test_2_new.vcf [main] INFO jvarkit - Starting JOB at Fri Mar 11 17:13:27 JST 2016 com.github.lindenb.jvarkit.tools.misc.FixVcfMissingGenotypes version=2b998754ad124603692f68473c4559d8235c3c1c built=2016-03-09:09-03-28 [main] INFO jvarkit - Command Line args : -d 20 -f bam_missing.list -- test.vcf [main] INFO jvarkit - Executing as yafei-mao@sango10513 on Linux 2.6.32-431.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_20-b26 [main] INFO jvarkit - Reading header for bar.Adif.bam [main] INFO jvarkit - Reading header for bar.Aech.bam [main] INFO jvarkit - Reading header for bar.Agem.bam [main] INFO jvarkit - Reading header for bar.Asub.bam Mar 11, 2016 5:13:27 PM com.github.lindenb.jvarkit.util.vcf.VCFUtils createVcfIterator INFO: reading from test.vcf [main] INFO jvarkit - Adding 'java.io.tmpdir' directory to the list of tmp directories [main] INFO jvarkit - Sample: bar.Adif [main] INFO jvarkit - Opening bar.Adif.bam

Thanks a lot^^

Yafei

MaoYafei commented 8 years ago

Sorry for bothering again, one more curious, how long usually it takes for whole running? my running still is on 'Opening bar.Adif.bam', how could I check it is running well or not?

lindenb commented 8 years ago

it depends of the size of the VCF. There is usually a tmp vcf file that is created (see the LOG), you could see if it is growing or not.

MaoYafei commented 8 years ago

I have 10G VCF file, and i did not log file in my directory and /tmp directory. Also, in my case, test_2_new.vcf has no changed, still nothing.

lindenb commented 8 years ago

test_2_new.vcf is created at the end. 10G is large: for each genotype, there is a random access in the BAM file: for a large VCF that will be a very long process.

I've updated the code and added a progress log : https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/misc/FixVcfMissingGenotypes.java#L136

then, you should see the speed of the program.

MaoYafei commented 8 years ago

I got it, but it seems like that it will take long time=-=, java -jar jvarkit/dist-2.0.1/fixvcfmissinggenotypes.jar -d 20 -f bam_missing.list -- test.vcf >test_3_new.vcf [main] INFO jvarkit - Starting JOB at Fri Mar 11 20:39:53 JST 2016 com.github.lindenb.jvarkit.tools.misc.FixVcfMissingGenotypes version=894f5cbfc8e8c137aa41f974497cee3d82227c6b built=2016-03-11:20-03-12 [main] INFO jvarkit - Command Line args : -d 20 -f bam_missing.list -- test.vcf [main] INFO jvarkit - Executing as yafei-mao@sango11005 on Linux 2.6.32-431.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_20-b26 [main] INFO jvarkit - Reading header for ./bar.Adif.bam [main] INFO jvarkit - Reading header for ./bar.Aech.bam [main] INFO jvarkit - Reading header for ./bar.Agem.bam [main] INFO jvarkit - Reading header for ./bar.Asub.bam Mar 11, 2016 8:39:53 PM com.github.lindenb.jvarkit.util.vcf.VCFUtils createVcfIterator INFO: reading from test.vcf [main] INFO jvarkit - Adding 'java.io.tmpdir' directory to the list of tmp directories [main] INFO jvarkit - Sample: bar.Adif [main] INFO jvarkit - Opening ./bar.Adif.bam [main] INFO jvarkit - Count: 2379 Elapsed: 10 seconds(0.01%) Remains: 1 day(99.99%) Last: sc0000012:37392 [main] INFO jvarkit - Count: 4483 Elapsed: 20 seconds(0.02%) Remains: 1 day(99.98%) Last: sc0000012:65352 [main] INFO jvarkit - Count: 6795 Elapsed: 30 seconds(0.02%) Remains: 1 day(99.98%) Last: sc0000012:89880 [main] INFO jvarkit - Count: 8538 Elapsed: 40 seconds(0.02%) Remains: 1 day(99.98%) Last: sc0000012:107312

BTW, does it show that reading bam file will take 1 day or whole running will take 1 day ?

Plus, I think you are pretty professional at mapping reads to ref genome filed. Do you mind giving some suggestions for my study?

Actually, I have five very high coverage species, each of them have more than 100X coverage including mate pair data. Because I need to do de novo assembly for each of them. And also, I mapping four species reads to one outgroup species which have already assembled. I follow below pipeline:

bwa index Aten.fa bwa mem -R '@RG\tID:foo\tSM:bar\tLB:library1' -t 24 Aten.fa Adif_R1_trim.fastq Adif_R2_trim.fastq > Adif.sam [do for each of four respectively] samtools view -bS Adif.sam -o Adif.bam
samtools fixmate -O bam Adif.sam Adif_fixmate.bam samtools sort -@ 24 -O bam -o Adif_sorted.bam -T /tmp/Adif_temp Adif_fixmate.bam samtools faidx Aten.fa java -jar picard-tools-1.70/CreateSequenceDictionary.jar REFERENCE=Aten.fa OUTPUT=Aten.dict cp picard-tools-2.1.0/libIntelDeflater.so . java -jar picard-tools-2.1.0/picard.jar MarkDuplicates INPUT=Adif_sorted.bam OUTPUT=Adif_DM_sorted.bam METRICS_FILE=Adif.bam.metrics samtools index Adif_DM_sorted.bam java -jar picard-tools-2.1.0/picard.jar CreateSequenceDictionary R= Aten.fa O= Aten.dict samtools faidx Aten.fa java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R Aten.fa -I Adif_DM_sorted.bam -o Adif_realignment_targets.list java -jar GenomeAnalysisTK.jar -T IndelRealigner -R Aten.fa -I Adif_DM_sorted.bam -targetIntervals Adif_realignment_targets.list -o Adif_realigned_reads.bam java -jar GenomeAnalysisTK.jar -R Aten.fa -T UnifiedGenotyper -I Adif_realigned_reads.bam -o Adif_gatk.raw.vcf -stand_call_conf 30.0 -stand_emit_conf 0 -glm BOTH -rf BadCigar samtools mpileup -ugf Aten.fa Adif_realigned_reads.bam | bcftools call -vmO v -o Adif.samtools.raw.vcf java -Xmx4g -jar GenomeAnalysisTK.jar -R Aten.fa -T SelectVariants --variant Adif_gatk.raw.vcf --concordance Adif.samtools.raw.vcf -o Adif.concordance.raw.vcf java -jar GenomeAnalysisTK.jar -T CombineVariants -R Aten.fa --variant Adif.concordance.raw.vcf --variant Adif.concordance.raw.vcf .... -o output.vcf -genotypeMergeOptions UNIQUIFY java -jar GenomeAnalysisTK.jar -T SelectVariants -nt 24 -R Aten.fa -V five_combine.vcf -selectType SNP -o five_combine_snp.vcf java -jar GenomeAnalysisTK.jar -T VariantFiltration -R Aten.fa -V five_combine_snp.vcf --filterExpression "QD < 20.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -4 || ReadPosRankSum < -2.0 || HaplotypeScore > 10.0" --filterName "five_combine_snp_filter" -o five_combine_filtered_snps.vcf

And I use this five_combine_filtered_snps.vcf for fixvcfmissinggenotypes^^

Also, I tried to follow best recommendation in GATK: doing BQSR

java -Xmx4g -jar GenomeAnalysisTK.jar -R Aten.fa -T VariantFiltration --filterExpression " QD < 20.0 || ReadPosRankSum < -8.0 || FS > 10.0 " --filterName LowQualFilter --missingValuesInExpressionsShouldEvaluateAsFailing --variant Adif.concordance.raw.vcf --logging_level ERROR -o Adif.concordance.flt.vcf grep -v "Filter" Adif.concordance.flt.vcf > Adif.confidence.raw.vcf java -jar picard-tools-2.1.0/picard.jar AddOrReplaceReadGroups I=Adif_realigned_reads.bam O=Adif_group_realigned_reads.bam RGID=4 RGLB=library1 RGPL=illumina RGPU=unit1 RGSM=Adif samtools index Adif_group_realigned_reads.bam java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R Aten.fa -I Adif_group_realigned_reads.bam -knownSites Adif.confidence.raw.vcf -o Adif_recal_data.table

java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R Aten.fa -I Adif_group_realigned_reads.bam -knownSites Adif.confidence.raw.vcf -BQSR Adif_recal_data.table -o Adif_post_recal_data.table

java -jar GenomeAnalysisTK.jar -csv Adif_my_report.csv -T AnalyzeCovariates -l DEBUG -R Aten.fa -before Adif_recal_data.table -after Adif_post_recal_data.table -plots Adif_recalibration_plots.pdf

/usr/bin/Rscript BQSR.R Adif_my_report.csv Adif_recal_data.table Adif_recalibration_plots.pdf

java -jar GenomeAnalysisTK.jar -T PrintReads -R Aten.fa -I Adif_group_realigned_reads.bam -BQSR Adif_recal_data.table -o Adif_recal_reads.bam java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -nct 24 -R Aten.fa -I Adif_recal_reads.bam --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -o Adif_raw_variants.vcf -ERC GVCF java -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R Aten.fa -nt 24 -V Adif_raw_variants.g.vcf .... -o teds.vcf

but I follow this recommendation pipeline, I got very low base quality for Adif_recal_reads.bam (no qscore greater than 20). I think I donot have good known SNP data to do BQSR so that I got low base quality bam, and also I found some researchers think that if you had high coverage and illumina data, you do not have to do BQSR.

My purpose for doing mapping is that I could get SNP data (every site should contain all of species, no missing data) to do downstream analysis. And then, do you think my first pipeline (only do realign and choose concordance SNP from gatk and samtool, and filter low quality site, and combine them together) is enough for my purpose?

Sorry for many questions, Thanks again^^

lindenb commented 8 years ago

show that reading bam file will take 1 day or whole running will take 1 day ?

one day per SAMPLE.

again: the BAM index is used for each position in the VCF . there is a lot of I/O and It could take a huge time.

lindenb commented 8 years ago

I think you are pretty professional at mapping reads to ref genome filed. Do you mind giving some suggestions for my study?

please ask biostars.org