Closed sr320 closed 7 years ago
Five most relevant BedTools commands:
genomecov
= summarizes the coverage per chromosome sequence, and across the entire genome, as a histogram. It also summarizes several other key statistics, including the number of bases covered at a certain depth, and the total number of bases per chromosome.
coverage
= computes the depth and breadth of coverage of features in file B (ie. a sample individual from one population) on the features in file A (ie. sample individual from another population, or the same population to explore sources within-population variation)
jaccard
= calculates similarity metrics for pairs of datasets to summarize similarities across samples, using the Jaccard Similarity statistic.
merge
= combines overlapping features (like short sequence reads) into a single feature (like a longer, contiguous sequence that could then be aligned to a genome)
igv
= integrates BEDTOOLS with IGV visualization software. Use igv
to create a batch script that will provide an IGV snapshot.
Three common filetypes: BED, GTF/GFF, BAM
BEDTools allow the user to directly work with data in BED, GTF and GFF filetypes. Based on my research interests, here are the five BEDTools command most relevant to me:
intersect
allows me to find overlapping regions between two separate range files. This will be useful for preliminary data exploration.merge
would take overlapping ranges (possibly identified in intersect
) and merge them into a single range.genomecov
summarizes the coverage per chromosome sequence for the full genome. The output is a histogram. Another good exploratory tool.annotate
takes BED files for CpG islands, methylation regions, etc. and annotates how much coverage those files have on another input file. This would be good to characterize how much of my exons are regulated by epigenetics.getfasta
extracts sequences for a given set of ranges and puts them in .fasta files. I could use this to identify specific adapter sequences to trim out with FastQC.I did a lot of google searches for "BEDtools" and "metaproteogenomics". Since I haven't done so much with metagenomic data, and nothing with alignment data, I found this documentation from a comparative functional analysis workshop offered at UCSD. As a complement to looking 'directly' at proteins, it may be useful to do something like this: 1. classifying found genes with similar function into Clusters of Orthologous Groups (COGs) using WebMGA; 2. comparing the expression of the different COGs by looking at their coverage in different samples (say, in a depth profile). For this, the BEDtool commandcoverage
would be useful for figuring out the coverage for every gene in every contig.
Some others that might be useful:
getfasta
extracts sequences for a given set of ranges and exports them as a FASTA file. merge
takes and combines overlapping sequence features into a single feature which spans all of the combined features.bamtofastq
extracts FASTQ records from sequence alignments in BAM format.map
you can use this tool to overlap features in a one file onto features in another file and apply statistics and/or summary operations on those features.Five most relevant BedTools commands:
1) intersect
will screen for overlaps between two genomic features.
2) bamtobed
converts BAM files to BED files. Many analyses are easiest to perform using Bed formats because they are simple tab delimited files that include properties about segments of the genome, defined by the chromosome with start and end coordinates.
3) coverage
will tell you how much of the genome your data covers.
4) genomecov
summarizes coverage of features along chromosome sequence and across the entire genome and creates a histogram.
5) annotate
annotates one BED/VCF/GFF file with the coverage and number of overlaps with multiple other files. This command will allow you to determine how one feature correlates with multiple other features types.
Three common file types are BED, GTF, and GFF.
1) jaccard
generates a statistic for measuring the similarity between two sets of data based on the intersection of matching base pairs, which could be useful when comparing populations.
2) bamtofastq
extracts fastq records from BAM files which may be useful when I'm interested in how the quality of certain reads influenced the alignment (for example, in a stack in the Stacks pipeline)
3) igv
combines bedtools with IGV visualization software for pretty visuals
4) fisher
is a statistical test for testing whether two sets of intervals are related spatially, which I might want to use when using paired end (reverse files) data to construct contigs
5) genomecov
generates histograms and other summary files for describing genome coverage
Common file types in bedtools
: BED, GTF, & GFF.
Seeing that I'm working with a non-model organism and don't have a genome to work with, I don't have an immediate use for working with range data. Though if I did have a genome I think these would be top five relevant BedTool commands for working with and visualizing short read expression data in relationship to a reference genome:
coverage
: Could be used to calculate the amount of coverage my RNAseq reads have compared to the reference genome merge
: could be used to construct complete coding sequences by comparing short reads to a genome and merging overlaps into complete sequences. cluster
: pull together overlapping/nearby intervals. multicov
: Counts overlapping reads from multiple bam files. Would be used for counting alignments to a genome from RNAseq reads in BAM format if I was interested in expression from specific regions of the genome. igv
: quick way to bring reads into IGV and visualize where the short reads line up with the genome.Three common file types: BED, GFF/GTF, and BAM
Five most relevant BedTools commands:
flank
: extracts sequence ranges flanking a region of interest; for example it can be used to extract promoter regionsmerge
: merge combines overlapping features into a single feature which spans all of the combined features.intersect -wa
or -wb
: extracts overlaps between two sets of ranges, and returns entire range of A or B (depending on your subcommand)intersect -v
: returns all non-overlapping rangesgenomecov
: summarizes the coverage (in percent) of features along chromosome sequencesThree common filetypes used with BedTools:
I'm also not working with a model species with a genome, and iPyRad is sort of a one stop shop for my analysis, but knowing a bit about how bedtools works helps me understand how iPyRad extracts data for the output files it provides, and how it works with the de novo assembly method. Given that sample coverage is important particularly for the EpiRAD data, there are a few bedtools commands that would be useful in this respect.
genomecov
is a tool that can provide coverage for a given chromosome. In my de novo assembly case, "chromosomes" are individual loci.
jaccard
can provide a summary of similarities across samples.
igv
would be useful to visualize data.
annotate
would allow me to compare coverage across samples.
multicov
I could see this being useful if, for example, I use a single sample as a reference in the future (e.g. a bleached sample to use as a pseudoreference genome).
Five most relevant commands for this project or future research:
flank
allows you to extract flanking ranges (those on either side of a region of interest) which would be useful if I wanted to look at specific promoter regions and if their methylation status since methylation in promoters is associated with gene silencing
genomecov
shows coverage (with a histogram) of features along chromosome sequences which might be useful if I had an interest in the depth of coverage at a specific area of the genome
annotate
can take a set of files and show how much coverage each file has over another file which could be useful if I wanted to compare coverage across multiple samples
merge
merges overlapping ranges into a single range which could be useful if I wanted to combine some bits of my sequence that overlapped into a single sequence
multicov
counts the number of alignments in different BAM files that overlap with a single BED file which could be useful in a experiment if I wanted to compare treatments to a control
Common file types: BED, GTF, GFF
What do you consider the five most relevant BedTools commands based on your research interest?
Please list them and indicate what each one does.
What are three common filetypes used with BedTools?