sr320 / course-fish546-2021

1 stars 1 forks source link

Alignment file types #58

Closed sr320 closed 3 years ago

sr320 commented 3 years ago

What is the difference between a SAM and BAM filetype. Which one could you use to find variants (ie SNPs) and what would an example command look like?

What is one means by which the textbook indicates this can be visualized?

skreling commented 3 years ago

SAM files are text files, while BAM files are the binary counterpart. Many samtools subcommands bust have the input file be in binary (BAM).

To find SNPs, you will first need a position-sorted BAM file created by: samtools sort @inputfilename newfileprefix

We can then index the BAM file (note a SAM file cannot be indexed) and then we can: samtools index filename.bam #note this will create a .bai file that contains the indexing

*skipping a few steps

To identify SNPs we'll use mpileup which requires BAM files samtools mpileup -no-BAQ --region 1:regionofintereststart-regionofinterestend \ --fasta-ref referencefiles.fasta samplefile.BAM

To call SNPs and generate genotype likelihood for every site in the genome and produce a Variant Call Format (.vcf) file or BCF (-g instead of -v) samtools mpileup -v --no-BAQ --region 1:regstart-regend \ --fasta -ref referencefile.fasta samplefile.BAM \ > sample.vcf.gz

To make sure misalignments in low-complexity areas are not causing erroneous SNP calls:

samtools mpileup -u -v --region 1:regionofinterest \ --fasta-ref referencefile.fasta samplefile.BAM > \ outputfile.vcf.gz grep -v "^##^ outputfile.vcf.gz | \ awk 'BEGIN{OFS="\t"} {split($8, 1, ";"); print $1, $2, $4, $5, $6, a[1], $9,$10}'

aspencoyle commented 3 years ago

What is the difference between a SAM and BAM filetype. Which one could you use to find variants (ie SNPs) and what would an example command look like?

What is one means by which the textbook indicates this can be visualized?

SAM (sequence alignment mapping) and BAM files are quite similar to each other. Essentially, BAM files are the binary analog of SAMs. This means that the two can readily be converted from one to another using samtools view -b filename.sam > filename.bam or, for BAM to SAM, samtools view -h filename.bam > filename.sam.

To find variants, you need two things - at least one indexed (and usually sorted) BAM file, and a reference sequence, typically in FASTA format. This reference sequence should be the same one used for mapping.

Here's what a command would look like. with a single BAM file: Notes on options: --no-BAQ turns off base alignment quality to maximize simplicity --region limits the pileup to a specific region. Specify region with numbers

samtools mpileup \
--no-BAQ \
--region 12345-23456 \
--fasta-ref filename.fasta \
filename.bam

There are a few options for visualizing variants. If you have just a small number of SNPs you're examining, you can use samtools tview. If you're looking to do a more in-depth examination, the Java application IGV (Integrated Genome Viewer) might be a good choice

laurel-nave-powers commented 3 years ago

SAM and BAM files are a common high throughput sequencing data alignment format. The main difference between the two is that SAM is plain text and BAM is the binary analog to SAM. To start finding variants like SNPs you need a position-sorted and indexed BAM file. Some example commands to use with a BAM file: samtools mpileup -no-BAQ -region ####-#### -fasta-ref name.fasta name.bam To visualize variants you can use samtools tview to quickly look at alignments in the terminal. This same command can be used to view the reference sequence.

jdduprey commented 3 years ago

SAM and BAM are sequence alignment data file formats. BAM is the binary version of a SAM file. The samtools commands used in the text all require the input file to be in BAM format for efficiency.

samtool mpileup --no-BAQ --region chr#:base####-#### \ --fasta-ref some-reference-genome.fasta some-input-sequence.bam

The above command would bring up a plain text pile-up format that summarizes bases at each chromosome position. The input file needs to be an indexed and position sorted BAM file.

samtools tview -p brings up a terminal text-based application for viewing a few sequence variants. The text suggests using Integrated Genomics Viewer for more demanding variant analysis.

Brybrio commented 3 years ago

BAM files are the binary analogs to SAM files, and they both store large amounts of alignment data. They can be converted to each other using samtools view -b celegans.sam > celegans_copy.bam or samtools view -h celegans.bam > celegans_copy.sam. To find SNP variants, I would use sorted (disk-space efficiency) and BAM files since most necessary samtools subcommands require BAM input for efficiency.

An example of code to identify variants could be:

samtools sort Salmon1_unsorted.bam Salmon1_sorted (sorting) samtools index Salmon1_sorted.bam (indexing) samtools mpileup -v --no-BAQ --region chromosome#:region#-region# --fasta-ref REFERENCE.fasta INPUT.bam > OUTPUT.vcf.gz (generating genotype likelihoods and output to Variant Call Format) bcftools call -v -m OUTPUT.vcf.gz > OUTPUT_calls.vcf.gz (checking whether sites are really variants)

And a way to visualize variants according to the textbook is using the samtools tview command, or through the Integrated Genomics Viewer for further inspection via igv.

meganewing commented 3 years ago

BAM and SAM store the same type of data (alignment mapping data)but BAM is binary and SAM is plain text. BAM takes up less space and is more efficient for processing, but it's relatively easy to convert between the two.

finding variants can be done with mpileup. For example, we could find and save variants within the indexed (and sorted) region 123456-124456 of chromosome 3 when compared to our reference genome fasta file: referenceGenome.fasta, and save the output to variantOutput.bam: $ samtools mpileup --no-BAQ --region 3:123456-124456 --fasta-ref referenceGenome.fasta variantOutput.bam Since sort, index, and mpileup only work with BAM (for the sake of efficiency), we would use a BAM file for this rather than SAM

This could be visualized with samtools tview for a terminal based vizualization, or IGV for a visualization better suited for more thorough investigations of your newfound variants (plus its much easier on they eyes, in my opinion)

dippelmax commented 3 years ago

SAM are text files and BAM is binary. You can find snps by using sort and index, then using pileup format. ! samtools sort file_name_unsorted.sam file_name_sorted ! samtools index file_name_sorted ! samtools mpileup --no-BAQ --region 1:215906528-215906567 \ --fasta-ref reference_genome.fasta file_name.bam These can be visualized with samtools tview