Closed sr320 closed 7 years ago
Much of my alignment data is managed through stacks
; the stacks
pipeline uses alignments of short reads to build a catalog of loci, and then determines the location and alleles of SNPs by aligning the reads to the new catalog.
However, stacks
doesn't provide a lot of visuals with the alignment and analysis. I might want to get a closer look at how an individual sample's reads are aligning to the new catalog of loci, which could be achieved with samtools
.
Example code
samtools tview sampleID_01.bam cstacks_catalog.fasta
Outside of the stacks
pipeline, I'll need to align Pacific cod DNA sequences to the Atlantic cod genome, searching for overlaps between the Pacific cod catalog of loci (a "reference database") and a database for the Atlantic cod genome. I could do this using bedtools
.
Example code:
coverageBed -a cstacks_catalog.bam -b ACodGenome* -hist
I'm definitely interested in using annotate
to examine coverage of lncRNAs and miRNAs in my transcriptome. (ex. bedtools annotate -i samplemiRNA.bam -files transcriptome.bam
)
Using samtools tview
or IGV to look at sequence alignments, and where the features I'm annotating above are would be useful for me. It's a lot easier for me to conceptualize what I'm doing analytically if I can see it in front of me! (ex. samtools tview -p [some region] sample.bam OlyO_v6_transcriptome.fa
)
In addition to proteomics, I could also look at epigenetic variation between oysters reared under different hatchery conditions. With genomic data I could use bedtools intersect
to see where reads in one oyster replicate overlap with reads in another replicate or across treatments.
First I would need to sort the files:
samtools sort oysterrep1.bam oysterrep1_sort.bam
samtools sort oysterrep2.bam oysterrep2_sort.bam
Then index the new files:
samtools index oysterrep1_sort.bam
samtools index oysterrep2_sort.bam
Then filter the sorted files and reindex the new filtered files using samtools view
and samtools index
Then convert the filtered files to BED using bamtobed
Then merge the bed files using bedtools merge
Then I can intersect and look for regions of overlap between replicates or across treatments.
bedtools intersect [OPTIONS] -a <FILE> -b <FILE1, FILE2, ..., FILEN>
As Mary mentioned, much of what we could use bedtools
for, we do in Stacks.
When doubting our data, we sometimes want to double-check that our programs are working like we think they are. I could use bedtools
to call SNPs in stacks to verify that that part of the program is working.
bcftools call -v -m stack.vcf.gz > output_variants.vcf.gz
I could use jaccard
as a measure of how different two sets of data are in addition to the other statistical tests our lab usually conducts.
bedtools jacard -a sample1.vcf -b sample2.vcf -f .1
If I had a reference genome I could quantify the amount of coverage my RNAseq reads have for a specific region of interest in the genome:
First create a bam file using samtools for a region of interest from the genome:
samtools view -bh genome.bam chr1: (region of interest) > reference.bam
Then use bedtools to calculate coverage for that region with my reads that were aligned to the genome (perhaps using bowtie):
bedtools coverage -abam reads.bam -b reference.bam > reference_coverages.bed
If I wanted to check the iPyRad .vcf output I could use samtools to make the same file and compare, e .g.:
samtools mpileup -v --fasta-ref reference.fa alignment.bam > output.vcf.gz
bcftools call -v -m output.vcf.gz > variants.vcf.gz
But given that I have a de novo assembly, I am not sure what file I would use for the reference, if any.
Bedtools could be a good tool for summarizing read count info via genomecov
:
bedtools genomecov -i sample.bed -g reference
Again, not sure what the reference would be, but I believe that in de novo assembly mode, iPyRad comes up with a consensus sequence that is used as reference.
The vast majority of marine microbes remain uncultured, so I reference genomes are hard to come by. I may eventually be interested in assembling individual microbial genomes from metagenomes, similar to what was done by Albertson et al., 2013. I got some ideas of how to use samtools from this publication's awesome GitHub Pages guide to recreating their work.
I'd use bowtie2
to map meagenome reads to an assembled scaffolds and store the alignment in SAM format. I'd want to know about the coverage of each scaffold, so I could use the depth
function in samtools (first I'd convert SAM to BAM format and then sort it). Then depth
generates a position-based coverage profile of each scaffold:
samtools view -bS meta-alignment.sam > meta-alignment.bam
samtools sort meta-alignment.bam meta-alignment.sorted
samtools depth meta-alignment.sorted.bam > depth.txt
I could generate coverage values for each scaffold using the BEDtools function coverage
:
bedtools coverage -a a_region.bam -b a_scaffold.bam -bed
For my project I could potentially use bedtools
to look at my sequencing coverage across the genome and determine if I have sufficient coverage to make a call on percentage of methylation at CpG sites in regions of interest:
bedtools genomecov -i trout_sample.bed -g trout_genome.txt -bg | \
awk '$4 > 9' > sufficient_coverage.txt
In this example I am using bedtools genomecov
to inform me of the genome wide coverage but output that information to BedGraph
format. I would then pipe that output to an awk
command which only includes data with more than 9 reads (that information is contained in column 4 of the bedtools genomecov
command output). Finally, I output the information from the awk
command to a text file so that I can look at the ranges with sufficient coverage.
Additionally, I could use samtools tview
to quickly look at some of my alignments since it's nice to be able to visualize data:
samtools tview trout_sample.bam trout_genome.fasta
This code might produce a lot more than I really want to try and look at, so if I want to look at a specific region that I'm interested in I could use:
samtools tview -p <region of interest> trout_sample.bam trout_genome.fasta
bedtools intersect -v
with the geoduck genome scaffolds & transcriptome to extract all non-overlapping ranges (aka ranges not expressed); from there I could try to blast against Uniprot database to see if there are any mapped genes not expressed in the transcriptome that are present in the genome. Similarly, I expect that I could use genomecov
to summarizes the coverage (in percent) of methylation sites along my scaffolds:
$ bedtools intersect -a ranges-qry.bed -b ranges-sbj.bed -v
$ bedtools genomecov -i ranges-cov.bed -g cov.txt
samtools
with RNASeq data (that I recently received) to identify potential SNP locations using mpileup
and bcftools
, then map those to my genome:
samtools mpileup --v --fasta-ref \ [referencegenomefile.fasta] [inputfile.bam] > [ouputfile.vcf.gz] bcftools call -v -m [outputfile.vcf.gz] > [outputfile_SNPS.vcf.gz]
How could you use alignment data (and
samtools
) as well asbedtools
in your own project?Provide a least two examples of what said code might look like.