samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
652 stars 240 forks source link

bcftools consensus files only has reference sequence in it #934

Open kdbchau opened 5 years ago

kdbchau commented 5 years ago

After running bwa index on my ref_geneA.fa file and running bwa mem on the ref and different species exome data, I then used samtools to sort and index the output files. I am now trying to get just the consensus sequence from the aln.sorted.bam files for each species, 1 at a time (so I can later align them all using MAFFT to create one MSA).

Here is my command that I am stuck on (the versions of samtools and bcftools are both 1.9).

samtools mpileup -uf ref_geneA.fa species1.aln.sorted.bam | bcftools call -mv -Oz -o species1.calls.vcf.gz tabix species1.calls.vcf.gz cat ref_geneA.fa | bcftools consensus species1.calls.vcf.gz > consensus.fa

When I open the consensus.fa it only has the sequence from ref_geneA.fa, and nothing else. Why is this happening?

bvaldebenitom commented 5 years ago

Hi balasink,

I have noticed the same issue. It seems to be an issue with executing "tabix" immediately after "call". I redid the "tabix" step and that seemed to solve the problem.

Anyhow, I'm trying to get a reproducible small test case to report this.

finswimmer commented 5 years ago

Hello,

samtools mpileupis deprecated. Use bcftools mpileup instead. Also -u is deprecated. Use -Ou instead.

Make sure you didn't receive any error messages. Otherwise your vcf file will just contain the header. Does your vcf file contain any variant?

fin swimmer

kdbchau commented 5 years ago

Hey so I tried it out and nothing worked I am still getting a consensus file that is empty.

When I viewed my species bam file with the ref.fa in samtools tview, I can see all the mapped reads so I know it's there...I just cannot get the consensus sequence.

I have tried so many versions of the mpileup codes but it's not generating.

The ones I've tried (the one below does generate a calls.vcf.gz file but the size seems small, but it could be because I am only looking at one gene)

bcftools mpileup -Ou ref_geneA.fa sp1_geneA_sorted.bam | bcftools call -mv -Oz -o calls.vcf.gz
tabix calls.vcf.gz
cat ref_geneA.fa | bcftools consensus calls.vcf.gz > consensus.fa

This generates a consensus.fa that only has the ref_geneA.fa sequence in it, nothing from calls.vcf.gz.


From: finswimmer notifications@github.com Sent: December 17, 2018 2:11 AM To: samtools/bcftools Cc: Katherine Balasingham; Author Subject: Re: [samtools/bcftools] bcftools consensus files only has reference sequence in it (#934)

Hello,

samtools mpileupis deprecated. Use bcftools mpileup instead. Also -u is deprecated. Use -Ou instead.

Make sure you didn't receive any error messages. Otherwise your vcf file will just contain the header. Does your vcf file contain any variant?

fin swimmer

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsamtools%2Fbcftools%2Fissues%2F934%23issuecomment-447733239&data=02%7C01%7C%7C06f79c454f8b499c8ba708d663e66b2f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636806238618034821&sdata=E9qxTQXg28DH7Ir2000uyRXXoOqSG9eF%2BsUfI8TOvvI%3D&reserved=0, or mute the threadhttps://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAXSIkNoNJ7rMd6-ihkVDeHIzjr57cd6Iks5u5zV0gaJpZM4ZUhb4&data=02%7C01%7C%7C06f79c454f8b499c8ba708d663e66b2f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636806238618034821&sdata=x%2FYUQbnhWPyRP1Ub6wKwv1b8zd0PGYTs3rDPKSrilgY%3D&reserved=0.

kdbchau commented 5 years ago

Also whenever I do run the following command I just get the whole bcftools help output instead of any error message or idea of what to do.

bcftools consensus -f ref_geneA.fa -s calls.vcf.gz -o consensus.fa

About: Create consensus sequence by applying VCF variants to a reference fasta
       file. By default, the program will apply all ALT variants. Using the
[warning] samtools mpileup option `v` is functional, but deprecated. Please switch to using bcftools 
mpileup in future.
       --sample (and, optionally, --haplotype) option will apply genotype
.
.
.

And the rest of the bcftools tooltip.

pd3 commented 5 years ago

There is a brief howto here http://samtools.github.io/bcftools/howtos/consensus-sequence.html. Check what your file calls.vcf.gz contains.

finswimmer commented 5 years ago

bcftools consensus -f ref_geneA.fa -s calls.vcf.gz -o consensus.fa

The -s parameter expect a sample name as a value and not the vcf file. If you just have one sample in the vcf file, you don't need to specify the sample name, whose variants should be used for building consensus.

fin swimmer

kdbchau commented 5 years ago

But in your example, -s is calling the vcf file. If I had multiple samples in the vcf file would it just be: -s Sp1 calls.vcf.gz ?


From: finswimmer notifications@github.com Sent: December 18, 2018 4:22 AM To: samtools/bcftools Cc: Katherine Balasingham; Author Subject: Re: [samtools/bcftools] bcftools consensus files only has reference sequence in it (#934)

bcftools consensus -f ref_geneA.fa -s calls.vcf.gz -o consensus.fa

The -s parameter expect a sample name as a value and not the vcf file. If you just have one sample in the vcf file, you don't need to specify the sample name, whose variants should be used for building consensus.

fin swimmer

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://eur04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsamtools%2Fbcftools%2Fissues%2F934%23issuecomment-448135529&data=02%7C01%7C%7C0c3ecc56503547b0a3f708d664c1f051%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636807181447482606&sdata=w67An0HOSGxIe2U8KuKOP1Tbritr7CksX%2Be0inBzF%2Bc%3D&reserved=0, or mute the threadhttps://eur04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAXSIkIccuQXTYQoPlWT5D4f54js96Gyqks5u6KW_gaJpZM4ZUhb4&data=02%7C01%7C%7C0c3ecc56503547b0a3f708d664c1f051%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636807181447482606&sdata=WLLEtnZ9gfXFXhXN6yO2v8yLTlK%2BlR5ydNEQXjkUEAQ%3D&reserved=0.

finswimmer commented 5 years ago

Hello @balasink ,

this wasn't my example. I copied it from you ;)

Yes, if you have a sample named Sp1 in your multisample vcf, you can use -s Sp1

fin swimmer

kdbchau commented 5 years ago

Ohh omg i just realized that it is creating a file for my sample but it's using the header from the reference sample....lol!

Also, how do you make the consensus sequence for the sample not merge with the reference? Based on the samtools tview I can see that there were areas where my sp1 did not have coverage in some regions but consensus is filling in those gaps with the reference sequence. Is there a way to just get the consensus of sp1 sample merging all the contigs?


From: finswimmer notifications@github.com Sent: December 19, 2018 12:57 AM To: samtools/bcftools Cc: Katherine Balasingham; Mention Subject: Re: [samtools/bcftools] bcftools consensus files only has reference sequence in it (#934)

Hello @balasinkhttps://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fbalasink&data=02%7C01%7C%7Ced6456234dd84d9e792408d6656e8949%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636807922752680880&sdata=3UbMvpeXDdB%2F2tTsJF6GfUAlXppgI3N9K7s27L24EtE%3D&reserved=0 ,

this wasn't my example. I copied it from you ;)

Yes, if you have a sample named Sp1 in your multisample vcf, you can use -s Sp1

fin swimmer

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsamtools%2Fbcftools%2Fissues%2F934%23issuecomment-448469706&data=02%7C01%7C%7Ced6456234dd84d9e792408d6656e8949%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636807922752680880&sdata=LDvsWeegW8et9hCxOcTpTFIMTTE%2FKX52veWe3obbtD0%3D&reserved=0, or mute the threadhttps://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAXSIkAKDJXEHUBfy5LyPBWgDU3m6fKQ-ks5u6cdRgaJpZM4ZUhb4&data=02%7C01%7C%7Ced6456234dd84d9e792408d6656e8949%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636807922752680880&sdata=Wgwd6LAk535xUOU9TY7NWlmhihKWnNjuA03cTEnJ5zY%3D&reserved=0.

pd3 commented 5 years ago

This is currently not possible.

The feature you requested could be in principle supported if gVCF was given on input. Otherwise the program would have no way of knowing if sites absent from the VCF are missing because of lack of coverage or because they are non-variant.

kdbchau commented 5 years ago

This is currently not possible.

The feature you requested could be in principle supported if gVCF was given on input. Otherwise the program would have no way of knowing if sites absent from the VCF are missing because of lack of coverage or because they are non-variant.

Sorry but what is gVCF? My end goal is to have a consensus of just the species1.BAM file, and if it had variants at a site to show it, and if it had no coverage at a site, to then show that as a gap. I don't want any bases from the reference.fasta because in the end I will be creating a phylogeny based on the sequences so I cannot have reference bases incorporated into all the species. It will produce incorrect results.

finswimmer commented 5 years ago

See here.

jonwhit commented 4 years ago

@balasink Did you ever figure out a solution to this problem? I've been looking for the same solution to generate a fasta file for each sample (individual/species) with N's filled in gaps to use in phylogeny. Thanks in advance for any help with your solution. Thanks.