sr320 / course-fish546-2016

6 stars 5 forks source link

Q3- Bringing it together #87

Closed sr320 closed 7 years ago

sr320 commented 7 years ago

How could you use alignment data (and samtools) as well as bedtools in your own project?

Provide a least two examples of what said code might look like.

mfisher5 commented 7 years ago

Much of my alignment data is managed through stacks; the stacks pipeline uses alignments of short reads to build a catalog of loci, and then determines the location and alleles of SNPs by aligning the reads to the new catalog.

However, stacks doesn't provide a lot of visuals with the alignment and analysis. I might want to get a closer look at how an individual sample's reads are aligning to the new catalog of loci, which could be achieved with samtools. Example code samtools tview sampleID_01.bam cstacks_catalog.fasta

Outside of the stacks pipeline, I'll need to align Pacific cod DNA sequences to the Atlantic cod genome, searching for overlaps between the Pacific cod catalog of loci (a "reference database") and a database for the Atlantic cod genome. I could do this using bedtools. Example code: coverageBed -a cstacks_catalog.bam -b ACodGenome* -hist

yaaminiv commented 7 years ago

I'm definitely interested in using annotate to examine coverage of lncRNAs and miRNAs in my transcriptome. (ex. bedtools annotate -i samplemiRNA.bam -files transcriptome.bam)

Using samtools tview or IGV to look at sequence alignments, and where the features I'm annotating above are would be useful for me. It's a lot easier for me to conceptualize what I'm doing analytically if I can see it in front of me! (ex. samtools tview -p [some region] sample.bam OlyO_v6_transcriptome.fa)

Ellior2 commented 7 years ago

In addition to proteomics, I could also look at epigenetic variation between oysters reared under different hatchery conditions. With genomic data I could use bedtools intersect to see where reads in one oyster replicate overlap with reads in another replicate or across treatments.

First I would need to sort the files: samtools sort oysterrep1.bam oysterrep1_sort.bam samtools sort oysterrep2.bam oysterrep2_sort.bam

Then index the new files: samtools index oysterrep1_sort.bam samtools index oysterrep2_sort.bam

Then filter the sorted files and reindex the new filtered files using samtools view and samtools index

Then convert the filtered files to BED using bamtobed

Then merge the bed files using bedtools merge

Then I can intersect and look for regions of overlap between replicates or across treatments. bedtools intersect [OPTIONS] -a <FILE> -b <FILE1, FILE2, ..., FILEN>

nclowell commented 7 years ago

As Mary mentioned, much of what we could use bedtools for, we do in Stacks.

When doubting our data, we sometimes want to double-check that our programs are working like we think they are. I could use bedtools to call SNPs in stacks to verify that that part of the program is working.

bcftools call -v -m stack.vcf.gz > output_variants.vcf.gz

I could use jaccard as a measure of how different two sets of data are in addition to the other statistical tests our lab usually conducts.

bedtools jacard -a sample1.vcf -b sample2.vcf -f .1

aspanjer commented 7 years ago

If I had a reference genome I could quantify the amount of coverage my RNAseq reads have for a specific region of interest in the genome:

First create a bam file using samtools for a region of interest from the genome:

samtools view -bh genome.bam chr1: (region of interest) > reference.bam

Then use bedtools to calculate coverage for that region with my reads that were aligned to the genome (perhaps using bowtie):

bedtools coverage -abam reads.bam -b reference.bam > reference_coverages.bed

jldimond commented 7 years ago

If I wanted to check the iPyRad .vcf output I could use samtools to make the same file and compare, e .g.:

samtools mpileup -v --fasta-ref reference.fa alignment.bam > output.vcf.gz bcftools call -v -m output.vcf.gz > variants.vcf.gz

But given that I have a de novo assembly, I am not sure what file I would use for the reference, if any.

Bedtools could be a good tool for summarizing read count info via genomecov:

bedtools genomecov -i sample.bed -g reference

Again, not sure what the reference would be, but I believe that in de novo assembly mode, iPyRad comes up with a consensus sequence that is used as reference.

MeganEDuffy commented 7 years ago

The vast majority of marine microbes remain uncultured, so I reference genomes are hard to come by. I may eventually be interested in assembling individual microbial genomes from metagenomes, similar to what was done by Albertson et al., 2013. I got some ideas of how to use samtools from this publication's awesome GitHub Pages guide to recreating their work.

I'd use bowtie2 to map meagenome reads to an assembled scaffolds and store the alignment in SAM format. I'd want to know about the coverage of each scaffold, so I could use the depth function in samtools (first I'd convert SAM to BAM format and then sort it). Then depth generates a position-based coverage profile of each scaffold:

samtools view -bS meta-alignment.sam > meta-alignment.bam
samtools sort meta-alignment.bam meta-alignment.sorted
samtools depth meta-alignment.sorted.bam > depth.txt

I could generate coverage values for each scaffold using the BEDtools function coverage:

bedtools coverage -a a_region.bam -b a_scaffold.bam -bed
mmiddleton commented 7 years ago

For my project I could potentially use bedtools to look at my sequencing coverage across the genome and determine if I have sufficient coverage to make a call on percentage of methylation at CpG sites in regions of interest: bedtools genomecov -i trout_sample.bed -g trout_genome.txt -bg | \

awk '$4 > 9' > sufficient_coverage.txt

In this example I am using bedtools genomecov to inform me of the genome wide coverage but output that information to BedGraph format. I would then pipe that output to an awk command which only includes data with more than 9 reads (that information is contained in column 4 of the bedtools genomecov command output). Finally, I output the information from the awk command to a text file so that I can look at the ranges with sufficient coverage.

Additionally, I could use samtools tview to quickly look at some of my alignments since it's nice to be able to visualize data: samtools tview trout_sample.bam trout_genome.fasta This code might produce a lot more than I really want to try and look at, so if I want to look at a specific region that I'm interested in I could use: samtools tview -p <region of interest> trout_sample.bam trout_genome.fasta

laurahspencer commented 7 years ago
  1. I can see myself using bedtools intersect -v with the geoduck genome scaffolds & transcriptome to extract all non-overlapping ranges (aka ranges not expressed); from there I could try to blast against Uniprot database to see if there are any mapped genes not expressed in the transcriptome that are present in the genome. Similarly, I expect that I could use genomecov to summarizes the coverage (in percent) of methylation sites along my scaffolds: $ bedtools intersect -a ranges-qry.bed -b ranges-sbj.bed -v $ bedtools genomecov -i ranges-cov.bed -g cov.txt
  2. I could anticipate incorporating what we've recently learned, using samtools with RNASeq data (that I recently received) to identify potential SNP locations using mpileup and bcftools, then map those to my genome: samtools mpileup --v --fasta-ref \ [referencegenomefile.fasta] [inputfile.bam] > [ouputfile.vcf.gz] bcftools call -v -m [outputfile.vcf.gz] > [outputfile_SNPS.vcf.gz]