sr320 / course-fish546-2016

6 stars 5 forks source link

Bedtool suite #86

Closed sr320 closed 7 years ago

sr320 commented 7 years ago

What do you consider the five most relevant BedTools commands based on your research interest?

Please list them and indicate what each one does.

What are three common filetypes used with BedTools?

mfisher5 commented 7 years ago

Five most relevant BedTools commands:

genomecov = summarizes the coverage per chromosome sequence, and across the entire genome, as a histogram. It also summarizes several other key statistics, including the number of bases covered at a certain depth, and the total number of bases per chromosome.

coverage = computes the depth and breadth of coverage of features in file B (ie. a sample individual from one population) on the features in file A (ie. sample individual from another population, or the same population to explore sources within-population variation)

jaccard = calculates similarity metrics for pairs of datasets to summarize similarities across samples, using the Jaccard Similarity statistic.

merge = combines overlapping features (like short sequence reads) into a single feature (like a longer, contiguous sequence that could then be aligned to a genome)

igv = integrates BEDTOOLS with IGV visualization software. Use igv to create a batch script that will provide an IGV snapshot.

Three common filetypes: BED, GTF/GFF, BAM

yaaminiv commented 7 years ago

BEDTools allow the user to directly work with data in BED, GTF and GFF filetypes. Based on my research interests, here are the five BEDTools command most relevant to me:

  1. intersect allows me to find overlapping regions between two separate range files. This will be useful for preliminary data exploration.
  2. merge would take overlapping ranges (possibly identified in intersect) and merge them into a single range.
  3. genomecov summarizes the coverage per chromosome sequence for the full genome. The output is a histogram. Another good exploratory tool.
  4. annotate takes BED files for CpG islands, methylation regions, etc. and annotates how much coverage those files have on another input file. This would be good to characterize how much of my exons are regulated by epigenetics.
  5. getfasta extracts sequences for a given set of ranges and puts them in .fasta files. I could use this to identify specific adapter sequences to trim out with FastQC.
MeganEDuffy commented 7 years ago

Relevant BEDtools commands for my research

I did a lot of google searches for "BEDtools" and "metaproteogenomics". Since I haven't done so much with metagenomic data, and nothing with alignment data, I found this documentation from a comparative functional analysis workshop offered at UCSD. As a complement to looking 'directly' at proteins, it may be useful to do something like this: 1. classifying found genes with similar function into Clusters of Orthologous Groups (COGs) using WebMGA; 2. comparing the expression of the different COGs by looking at their coverage in different samples (say, in a depth profile). For this, the BEDtool commandcoverage would be useful for figuring out the coverage for every gene in every contig.

Some others that might be useful:

Common filetypes used with BEDtools

  1. BED
  2. GFF/GTF
  3. BAM
Ellior2 commented 7 years ago

Five most relevant BedTools commands:

1) intersect will screen for overlaps between two genomic features.

2) bamtobed converts BAM files to BED files. Many analyses are easiest to perform using Bed formats because they are simple tab delimited files that include properties about segments of the genome, defined by the chromosome with start and end coordinates.

3) coverage will tell you how much of the genome your data covers.

4) genomecov summarizes coverage of features along chromosome sequence and across the entire genome and creates a histogram.

5) annotate annotates one BED/VCF/GFF file with the coverage and number of overlaps with multiple other files. This command will allow you to determine how one feature correlates with multiple other features types.

Three common file types are BED, GTF, and GFF.

nclowell commented 7 years ago

1) jaccard generates a statistic for measuring the similarity between two sets of data based on the intersection of matching base pairs, which could be useful when comparing populations.

2) bamtofastq extracts fastq records from BAM files which may be useful when I'm interested in how the quality of certain reads influenced the alignment (for example, in a stack in the Stacks pipeline)

3) igv combines bedtools with IGV visualization software for pretty visuals

4) fisher is a statistical test for testing whether two sets of intervals are related spatially, which I might want to use when using paired end (reverse files) data to construct contigs

5) genomecov generates histograms and other summary files for describing genome coverage

Common file types in bedtools: BED, GTF, & GFF.

aspanjer commented 7 years ago

Seeing that I'm working with a non-model organism and don't have a genome to work with, I don't have an immediate use for working with range data. Though if I did have a genome I think these would be top five relevant BedTool commands for working with and visualizing short read expression data in relationship to a reference genome:

  1. coverage: Could be used to calculate the amount of coverage my RNAseq reads have compared to the reference genome
  2. merge: could be used to construct complete coding sequences by comparing short reads to a genome and merging overlaps into complete sequences.
  3. cluster: pull together overlapping/nearby intervals.
  4. multicov: Counts overlapping reads from multiple bam files. Would be used for counting alignments to a genome from RNAseq reads in BAM format if I was interested in expression from specific regions of the genome.
  5. igv: quick way to bring reads into IGV and visualize where the short reads line up with the genome.

Three common file types: BED, GFF/GTF, and BAM

laurahspencer commented 7 years ago

Five most relevant BedTools commands:

Three common filetypes used with BedTools:

jldimond commented 7 years ago

I'm also not working with a model species with a genome, and iPyRad is sort of a one stop shop for my analysis, but knowing a bit about how bedtools works helps me understand how iPyRad extracts data for the output files it provides, and how it works with the de novo assembly method. Given that sample coverage is important particularly for the EpiRAD data, there are a few bedtools commands that would be useful in this respect.

genomecov is a tool that can provide coverage for a given chromosome. In my de novo assembly case, "chromosomes" are individual loci.

jaccard can provide a summary of similarities across samples.

igv would be useful to visualize data.

annotate would allow me to compare coverage across samples.

multicov I could see this being useful if, for example, I use a single sample as a reference in the future (e.g. a bleached sample to use as a pseudoreference genome).

mmiddleton commented 7 years ago

Five most relevant commands for this project or future research: flank allows you to extract flanking ranges (those on either side of a region of interest) which would be useful if I wanted to look at specific promoter regions and if their methylation status since methylation in promoters is associated with gene silencing

genomecov shows coverage (with a histogram) of features along chromosome sequences which might be useful if I had an interest in the depth of coverage at a specific area of the genome

annotate can take a set of files and show how much coverage each file has over another file which could be useful if I wanted to compare coverage across multiple samples

merge merges overlapping ranges into a single range which could be useful if I wanted to combine some bits of my sequence that overlapped into a single sequence

multicov counts the number of alignments in different BAM files that overlap with a single BED file which could be useful in a experiment if I wanted to compare treatments to a control

Common file types: BED, GTF, GFF