sr320 / course-fish546-2016

6 stars 5 forks source link

Alignment file types #85

Closed sr320 closed 7 years ago

sr320 commented 7 years ago

What is the difference between a SAM and BAM filetype. Which one could you use to find variants (ie SNPs) and what would an example command look like?

What is one means by which the text indicates this can be visualized?

yaaminiv commented 7 years ago

SAM and BAM files both provide alignment data, with BAM files being the binary analog of SAM files. The difference between the two is that SAM files have a header, while BAM files do not.

You can use BAM files to find SNPs, using samtools and bcftools.

$ samtools mpileup -v --region (Generate genotype likelihoods for specified sites in the genome; -v: generate results in variant call format, --region: specify region to generate likelihoods ) $ bcftools call -v (Filter results so only variant sites remain)

According to the text, you can visualize SAM or BAM files by using samtools tview.

MeganEDuffy commented 7 years ago

SAM and BAM are both alignment file formats. SAM and BAM are designed to contain the same information, but there are some important differences in their design: SAM is a text format for storing sequence data in a series of tab-delimited ASCII columns. It's pretty human readable and is easier to process by conventional text based processing programs like awk, sed, python, cut, etc.

The BAM format stores the same data in a compressed, indexed, binary form. The two formats also have different coordinate systems: SAM is 1-indexed, BAM is 0-indexed (like the annotation/track format BED).

You would use BAM to identify variants like SNPs using samtools and its companion tool bcftools. First, you usesamtools mpileup generate genotype likelihoods for every site in the genome or all sites in a specified region. Besides the BAM file, you also input a reference genome in FASTA format through the --fasta-ref (or -f) option so `samtools mpileup knows each reference base. The output is either tab-delimited VCF or binary BCF, by use of the -v or -g arguments, respectively). Then using bcftools call will filter the results to leave only variant sites and call genotypes at these sites. Example:

$ samtools mpileup -v --region 1:madeup \
--fasta-ref pmarinus.fasta pmarinus_sample.bam \
> pmarinus_sample.vcf.gz

then, using bcftools call, with the -v option (only outputs variant sites).

$ bcftools call -v pmarinus_sample.vcf.gz > pmarinus_sample_calls.vcf.gz

To visualize alignment data you can use use samtools tview. The input required for this tool are position-sorted and indexed BAM files. Like IGV, samtools tview loads the reference genome alongside alignments so the reference sequence can be compared. Though as we learned Tuesday, IGV is really nice for playing around with alignment data, and seems much more interactive than the command-line based samtools tview.

aspanjer commented 7 years ago

Both SAM and BAM files are produced from short read aligners and contain information on where short reads align to the genome and the quality of the alignment. BAM files are different from SAM files in that they are in Binary format and can't be read without software (vs. SAM which are plain-text). Additionally, SAM files are 1-based and BAM are 0-based. If using SAMtools, files need to be in BAM format before calling variants for efficiency.

First step is to use mpileup to generate genotype likelihoods for all sites in the genome or for a specified range:

$ samtools mpileup –v –fasta-ref genome.fasta sample.bam > output_file.vcf

To call variants (using bcftool):

$ bcftools call –v –m output_file.vcf > output_var.vcf

This can be visualized using "tview" from within SAMtools, which will line up reads with alongside the reference genome.

Ellior2 commented 7 years ago

Sequence Alignment Mapping (SAM) files are plain-text while BAM files are binary files which are larger more space-efficient complex files. You can convert between the two file types using samtools.

To find variants such as SNPs, you could use a sorted and indexed BAM file and upload it into a genome viewer such as IGV. You can also find variants by using samtools mpileup and its companion tools bcftools which is a two-step process:

1) First use samtools mpileup –v -fasta-ref genome.fasta oyster.bam > output.vcf.gz 2) Then use bcftools to call variants bcftools call –v –m output.vcf.gz > sample_calls.vcf.gz

To explore alignment data through the command line you can use samtools tview on a sorted and indexed BAM file. This command will load the reference genome with alignments so you can compare it with reference sequences.

nclowell commented 7 years ago

A SAM file is a tab delimited text file of a sequence alignment, and a BAM file is the binary version of a SAM file.

To look for SNPs, you can use pileup formats which stack reads and summarize variants. With samtools and bcftools this looks like:

samtools mpileup -v -fasta-ref genome.fasta sample.bam > output.vcf.gz
bcftools call -v -m output.vcf.gz > output_variants.vcf.gz

And the book suggests viewing with samtools tview

mfisher5 commented 7 years ago

A SAM file is a tab-delimited text file that is used for storing large nucleotide sequence alignments. A BAM file is the binary version of the SAM file, containing all of the same information. BAM files are best for compressing data into small file sizes, while SAM files are easier to read and process using commands like awk and sed.

(1) Generate genotype likelihoods for every site in the genome

$ samtools mpileup -v -no--BAQ --fasta-ref ACod_genome.fasta sampleID.bam > sampleID_output.vcf.gz

(2) Call true variants and determinewhat each individual's genotype is:

$ bcftools call -v -m sampleID_output.vcf.gz > sampleID_variants.vcf.gz

You can visualize alignment data using samtools tview, which allows you to open the alignment in IGV. You can also visualize a specific region with the additional argument -p 1:<regionID>

laurahspencer commented 7 years ago
jldimond commented 7 years ago

SAM and BAM are the current standard file types for storing information on reads aligned to a reference. SAM is a plain text format with header information, while BAM is binary version of SAM. BAM is the filetype you would use to find sequence variants.

An example of the two-part process to find variants using samtools and bcftools:

samtools mpileup -v --fasta-ref reference.fa alignment.bam > output.vcf.gz bcftools call -v -m output.vcf.gz > variants.vcf.gz

To visualize results you can use Unix commands like grep, or to get more in depth you can use samtools tview, and even fancier yet you can use IGV.

mmiddleton commented 7 years ago

SAM and BAM are both common alignment formats for sequence data mapped to a reference, but BAM files are binary while SAM files are plain text files containing a header section (with lots of metadata) and an alignment section.

To find variants:

samtools mpileup -v --fasta-ref trout_genome.fasta sample_input.bam > sample_output.vcf.gz

bcftools call -v -m sample_output.vcf.gz > variant_sample_output.vcf.gz

The text suggests that IGV is very useful for looking at variant data.