minisciencegirl / studyGroup

http://minisciencegirl.github.io/studyGroup/
Other
43 stars 20 forks source link

Bioconductor discussion/demo? #17

Closed ksamuk closed 8 years ago

ksamuk commented 9 years ago

"Bioconductor (http://www.bioconductor.org/) is an open source, open development software project to provide tools for the analysis and comprehension of high-throughput genomic data. It is based primarily on the R programming language."

Would anyone be interested in a session focused on demoing packages from this project? Perhaps several people could give demos of some of the Bioconductor packages they use? I feel a broad overview would be useful given the 1024 :scream: packages they maintain!

bkatiemills commented 9 years ago

Absolutely! Tool demos are awesome.

jennybc commented 9 years ago

I am teaching for bioinformatics.ca in mid-June, so people who work anything up for this discussion/demo that overlaps w/ that content … maybe they could help out with the workshop? Roles I would contemplate based on content, teaching experience, etc. would be as helper, TA, or session instructor. This is a paid gig BTW.

I would definitely welcome more Bioconductor stuff as it is a genomics-focused workshop, whereas I've mostly got good materials and knowledge on the "just" R side.

one day kick-off workshop: http://bioinformatics.ca/workshops/2015/introduction-r-bc-2015

but mostly this two-day one is what I'm talking about: http://bioinformatics.ca/workshops/2015/exploratory-analysis-biological-data-using-r-bc-2015

sjackman commented 9 years ago

I am interested in helping out, but I have a thesis committee meeting (as you know) two days before. I'll be madly writing my thesis proposal, so I won't unfortunately have a much (any) time to help prepare material, but I could attend to help out.

jooolia commented 9 years ago

Would love to see some demos of what people use from there. I use phyloseq sometimes ("Handling and analysis of high-throughput microbiome census data." http://www.bioconductor.org/packages/release/bioc/html/phyloseq.html) for quick looks at species-abundance tables via community ecology metrics. I would be interested in knowing/seeing if people are using the flow cytometry workflows on Bioconductor.

ahippman commented 9 years ago

I would definitely come!

The only bioconductor package I use right now is edgeR to look into differential Expression of genes in an EST library. I could give a 10min short intro into that.

Cheers Anna

Anna Hippmann, PhD Student

Department of Earth, Ocean and Atmospheric Sciences The University of British Columbia Room 2041, Earth Sciences Building 2207 Main Mall Vancouver, British Columbia Canada, V6T 1Z4

ahippman@eos.ubc.ca office +1-604-827-5459 cell +1-604-771-8346

On May 7, 2015, at 11:48 AM, Julia Gustavsen notifications@github.com wrote:

Would love to see some demos of what people use from there. I use phyloseq sometimes ("Handling and analysis of high-throughput microbiome census data." http://www.bioconductor.org/packages/release/bioc/html/phyloseq.html) for quick looks at species-abundance tables via community ecology metrics. I would be interested in knowing/seeing if people are using the flow cytometry workflows on Bioconductor.

— Reply to this email directly or view it on GitHub.

jstaf commented 9 years ago

I could do invoking R as part of a command-line script, biomaRt, and DESeq2 (and possibly cummeRbund from the Cufflinks pipeline).

DESeq2 and cummeRbund are pretty heavy duty transcriptomics tools and have a lot of overlap with EdgeR. biomaRt, however, is a pretty fantastic tool that you can use to literally retrieve any amount of data from online databases like Ensembl or Interpro (here's an old tutorial I wrote on it awhile back for the Zoology R Club).

I generally chain DESeq2 and biomaRt together so you get a both a set of differentially expressed genes and their annotations from an RNA-Seq run (makes analysing your data reallllly easy).

@jennybc - I would totally be interested in helping out with the workshops, given the chance! In particular, I think biomaRt is almost a requirement for any genomics work in R these days.

jennybc commented 9 years ago

@kazi11 OK I will definitely be in touch re: biomart. I have not quite turned my mind to that workshop but must do so soon. The mandate is data exploration, so I'm definitely on the look out for some Bioconductor content that has a very general appeal and is exploration-oriented. I've already put out feelers to, e.g. the STAT 545 TA crew, with some interest as well.

ksamuk commented 9 years ago

Sounds like there is lots of potential here, everyone! I also use biomaRt heavily, but would be excited to learn more. I'd be happy to also demo GenomicRanges/IRanges, which provide general purpose containers for genomic data with known lengths (e.g. NGS reads, genes, chromosomes) and methods to compare them (e.g. find overlaps). GRanges/IRanges also show up a lot in other Bioconductor packages.

minisciencegirl commented 9 years ago

Can anyone suggest a decent R package to visualize SNPs from multiple VCF files to compare differences between various strains? And build a phylogeny from SNPs in these multiple VCF files?

ksamuk commented 9 years ago

VariantAnnotation allows reading in VCFs. ape allows you to build various types of trees from a FASTA file.

I don't usually do this in R, but the general approach would be to build a FASTA from the VCF (either by just concatenating the SNPs, or by using the reference to mark gaps), then feed that into a tree building package (e.g. ape). If you want to build any fancier trees you could use BEAST. In our lab, we've been using SPLITSTREE for basic visualization of relationships among populations. It builds an unrooted phylogenetic network rather than a tree.

minisciencegirl commented 9 years ago

Thanks Kieran!!

I will take a look at these R packages.

I am also looking at SNPRelate as another option.

Cheers,

Amy

On May 26, 2015, at 3:49 PM, Kieran Samuk notifications@github.com wrote:

VariantAnnotation allows reading in VCFs. ape allows you to build various types of trees from a FASTA file.

I don't usually do this in R, but the general approach would be to build a FASTA from the VCF (either by just concatenating the SNPs, or by using the reference to mark gaps), then feed that into a tree building package (e.g. ape). If you want to build any fancier trees you could use BEAST. In our lab, we've been using SPLITSTREE for basic visualization of relationships among populuatlions. It builds a unrooted phylogenetic network rather than a tree.

— Reply to this email directly or view it on GitHub.

jstaf commented 9 years ago

If we ever end up doing this one, I think I just might cover using biomaRt to retrieve gene/protein annotations + DNA sequences as well as interconvert between different types of gene annotation ID's.

The reason I mention this is that after the Hadley webinar this morning, people were mentioning how hard it was to use some type of bioinformatics data because of how tough it is to get the annotations to match up (for instance, Drosophila has genes annotated either as FBgn#s, CG#s, or actual gene names). biomaRt makes this (somewhat) painless, and is pretty good at batch converting weird IDs to something useful.

radaniba commented 9 years ago

if you ever end up doing this one, you should absolutely proceed by extracting (demoing packages) annotation from different sources. like ensemble and ucsc for example. The reason is that annotations are sometimes not uniform, there is a lot of discrepancy across databases and this is what makes (actually) extracting bioinformatics data difficult.

I had the chance to work on different projects involving gene set enrichment, and let me tell you that the results are most of the time different. This is a problem itself, regardless of the tool used (package). The design of the tool that will allow you to generate at some point pvalues or statistics, in general rely on background, being itself extracted and curated from somewhere. If this background is faulty, there will be a lot of problems getting the real picture from the noisy one.

jstaf commented 9 years ago

Yeah, I know what you mean. In particular, I've had a lot of issues with FBgn#s changing extremely rapidly with each new annotation version. This gets made worse by the fact that Ensembl (seems like the only place with FlyBase's data) is often hopelessly out of date, or has a weird version that no one else uses. Pretty sure biomaRt has the ability to pull archived/previous versions of annotation, but that particular workaround only works when the database you're accessing is actually ahead of the annotation you're using. Fun stuff.

radaniba commented 9 years ago

Interesting. Versioning is another layer of the problem that comes making the whole thing too complicated. While reading your comment I had idea flying in my head :

@BillMills let's write a blog post :bulb: (collaborative blog post with input from everyone dealing with these issues everyday)

bkatiemills commented 9 years ago

@radaniba that blog post sounds amazing, if people pull that together Mozilla will surely print it :)

But back to the bioconductor lesson, @kazi11 focusing down on biomaRt to solve common problems sounds good to me - keeping the scope (relatively) small keeps things digestible in an hour. People are clearly interested, so if you're up for it, fire away!

sjackman commented 9 years ago

@kazi11

I've had a lot of issues with FBgn#s changing extremely rapidly with each new annotation version

I wrote the tool UniqTag to tackle this exact issue. It assigns each sequence (gene or whatever) a unique ID based on the alphabetically-smallest unique k-mer in each sequence. The tool is here: https://github.com/sjackman/uniqtag brew install uniqtag. There's an R packages as well install.packages("uniqtag") and the paper was just published in PLOS ONE.

jstaf commented 9 years ago

@sjackman - Hmmmm, looks useful... I will definitely give that a whirl next time I need to convert ID's!

bkatiemills commented 9 years ago

So - bioconductor in July? We've got something on the 2nd, but we're wide open after that!

jstaf commented 9 years ago

As a follow-up on this thread, I think I'm going to teach a lesson on biomaRt here