swcarpentry / DEPRECATED-bc

DEPRECATED: This repository is now frozen - please see individual lesson repositories.
Other
299 stars 383 forks source link

Domain-specific bioinformatics example #532

Closed rbeagrie closed 8 years ago

rbeagrie commented 10 years ago

A couple of people have said over the past month or so that a domain specific example that involves handling next-generation sequencing data would be a very useful thing to have - so it seemed like a good project for the July sprints next month.

I'm starting with the assumption that this will be a 1-2 hour exercise to sit right at the end of a SWC bootcamp. Since the people who are interested seem to be quite diverse, I would really appreciate it if people could leave some comments as to what might be most useful to them. This would really help us focus down on the right direction to be working in come July.

Specifically, if you would be interested in teaching with a domain specific example, it would be great if you could let us know:

1) Who would your target audience be, beginners, intermediates or something else? 2) Are there any tools you would like to be covered (I'm thinking bedtools, SAMtools, HTSeq, DE-seq etc.) 3) Should we aim more towards giving learners something they can do right now (e.g. making a bigwig file and uploading it to UCSC) or something that showcases the things they will be able to do with a bit more independent study (e.g. RNA-seq analysis) 4) How much should we try to tie in with the relevant set of language lessons, and which language? For example, should this be something that learners can follow having covered only the concepts in novice R, or intermediate python etc? 5) Are there any other parts of the core Software Carpentry curriculum we can be trying to reinforce (e.g. SQL)

If you have any other comments/thoughts/opinions please let us know!

ctb commented 10 years ago

@rbeagrie, we do this for a living in the ANGUS course, albeit in a slightly different format (worked examples followed by free time for people to work with their own data) - see http://ged.msu.edu/angus/ for way too much info. Perhaps we can help by providing some fodder!

variant calling:

De novo mRNAseq and metagenome assembly:

More later.

stephenturner commented 10 years ago

I'm working on developing an RNA-seq workshop that can be completed in a day similar to what you described (alignment with tophat, count with featureCounts in linux shell; analysis with DESeq in R). This is very much a work in progress at this point, but you can see what I'm doing here:

https://github.com/stephenturner/teaching/tree/master/rna-seq

The basic idea: I downloaded some data from GEO, mapped everything, analyzed with DESeq, picked some interesting regions on chromosome 4, and then extracted FASTQ files from those regions from the bam files. This way participants will have very small fastq files to work with that should map quickly, and they can index a single chromosome for read mapping so as to reduce RAM requirements to the point that this would work on a VM running on an average laptop.

I'm happy to continue developing in this repo, and/or move this over to the SWC repo (with some help from someone with better git-fu than me, to help move material already under VC in a separate repo to the swcarpentry/bc repo while preserving the commit history).

gvwilson commented 10 years ago

See also http://web.archive.org/web/20110413204752/http:/software-carpentry.org/4_0/shell/exercises/

rbeagrie commented 10 years ago

These are all great examples, and I think it will definitely help to have something to work from during the sprints, rather than just starting from a blank slate. I think they nicely illustrate the two ways I think this could go:

  1. IMO, the examples from @ctb and @stephenturner would work well at the end of an intermediate workshop, where people would hopefully have enough experience to deal with installing the extra software they would need. I'm not sure something like this would fit at the end of a novice bootcamp though. Getting a group of 40 novices to install a mapper, plus two or three other downstream tools could be a bit of a nightmare, and I wouldn't want to end the two days with a straight demo that people couldn't follow along with themselves - that runs the risk of being a little demotivating
  2. @gvwilson's example would sit really well at the end of a novice bootcamp as it relies on unix tools that people would mostly have already used. On the other hand, I don't think it would work that well for intermediates, where I would prefer to show them tools written for NGS data that they could actually take into their own analyses.

This is why I think it's super important to nail down who and what this example is going to be for. As far as I'm aware, most of our bootcamps are still aimed at novices so I would lean towards something more like option 2. On the other hand, if most of the people who would use a domain specific example are running intermediate workshops then something more like option 1 would make sense.

If we decide that this is going to be most useful sitting at the end of a novice bootcamp, I think it ought to try to tie in as much of the core curriculum as possible. Following the ANGUS example, I like the idea of having novices clone a repository with some analysis already done and adding some extra bits, as you can reinforce and tie together program design, unix shell and version control all at once... the instructor could even have them issue pull requests against the original repo and code review each other's work (which is great as people could submit/comment even after the bootcamp has finished if they run out of time).

ctb commented 10 years ago

Whoops, forgot the key point: you can't run any of the assembly stuff on most people's laptops. The variant calling could be done, but a virtual machine is probably the best way to go. In practice, I would strongly urge people to use a VM if they're doing anything NGS-y. @stephenturner, is this true for the reference-based RNAseq analysis software too?

stephenturner commented 10 years ago

Working to get RAM requirements under 2G by indexing only a single chromosome and mapping only reads to that chromosome. Should be possible. And yes, despite the limitations the only way I'd teach this in a bootcamp is distributing a VM with software pre-installed.

ctb commented 10 years ago

On Tue, Jun 10, 2014 at 12:44:47PM -0700, Stephen Turner wrote:

Working to get RAM requirements under 2G by indexing only a single chromosome and mapping only reads to that chromosome. Should be possible. And yes, despite the limitations the only way I'd teach this in a bootcamp is distributing a VM with software pre-installed.

OK, same strategy I use :)

hdashnow commented 10 years ago

There are already some well developed NGS tutorials and the infrastructure to run them. For example we (vlsci.org.au) and others have developed this material https://genome.edu.au/wiki/Learn for our workshops. Andrew Lonie (http://vlsci.org.au/researcher/alonie) might be able to point you towards other resources that you could use or adapt.

rbeagrie commented 10 years ago

Hmm. I still feel quite strongly that a custom VM is not the best way to go in this specific instance, as it would massively cut down on the number of people that could potentially use this at the end of a novice bootcamp.

I propose a 1.5 hour example that could be done by a learner at the end of a novice bootcamp, involving investigating a FastQ file of unknown origin. I would break it up like this:

First half hour: Exploring FastQ files using commands covered in shell lectures (head, tail, wc etc) - based on the old v4 lesson Greg linked to

Second half hour: Parsing FastQ files using biopython, introducing quality strings etc, based on Will Trimble's biopython lesson from last year's Tufts bootcamp

Last half hour: BLASTing the first 50 or so reads from the FastQ file using BioPython's interface to the BLAST web service to find out what organism the data is from - inspired by Titus' zero entry BLAST stuff. Then a 10 minute wrap up with a brainstorming session on what problems learners might apply this sort of stuff to from their own research.

ctb commented 10 years ago

A few comments, and then I'll leave you alone --

gvwilson commented 10 years ago

@rbeagrie wrote:

Hmm. I still feel quite strongly that a custom VM is not the best way to go in this specific instance, as it would massively cut down on the number of people that could potentially use this at the end of a novice bootcamp.

I've had poor results using VMs in the classroom: they won't run well on older/slower machines, and people get lost in "wait, what's the keyboard shortcut for pasting when I'm in this window?" On the other hand, @ctb has had good luck getting people to run on cloud VMs - Titus, care to weigh in?

ctb commented 10 years ago

Sure -- all my bootcamps either bring up Amazon VMs for people (as with zero-entry workshops) or I teach people how to bring up their own Amazon VM (in workshops that are longer than a few days). The argument, again, is that people will actually be analyzing their NGS data on remote machines, so taking the time to introduce them to logins & remote command line doesn't harm.

stephenturner commented 10 years ago

@ctb have you ever gotten Amazon to give vouchers or anything for AWS usage? Or do you get participants to enter their billing / CC info? How much does this end up costing for a few hours of compute on a small dataset for a 1-2 day workshop?

rbeagrie commented 10 years ago

@ctb please don't feel like I want you to leave me alone! I pretty much agree entirely with your points. Especially this: "bioinformatics, or at least the NGS-y part of it, is a poor fit with the traditional Software Carpentry approach".

My big picture thinking here is that we want something that would allow someone to do:

$ swc get NGS-capstone

And get one short lesson that can round off a novice bootcamp, and show learners how they can apply what they have learned to their own 'NGS-y' research. It's entirely possible that there is nothing you can teach in a couple of hours that is best practice, relies only on the core software we ask people to install as part of a novice bootcamp and that doesn't require a VM. If so, that's fine. I definitely agree that in 99% of cases, if you are iterating over raw sequencing reads in python (or any other language) you are probably "doing it wrong".

One possible compromise would be a set of "work through these yourself" challenges with samtools and bedtools. They are widely used tools, and if you use them correctly they allow you to accomplish a lot on your own laptop without much RAM. The (big) compromise here is that helpers and instructors will likely spend most of the lesson dealing with installation headaches - hence why it would have to be diy challenges. The upside is that everyone goes away with a versatile toolset that will actually help them get stuff done.

ctb commented 10 years ago

Amazon is usually happy to provide $100 vouchers per student. For a 1-2 day workshop things usually cost less than $5; for a semester long course, most students don't go over $100.

ctb commented 10 years ago

@rbeagrie ;). I like the idea of the Web BLAST; maybe use the whole-proteome-vs-whole-proteome bit (BLAST ecoli x salmonella; output CSV of matches) from the zero entry bootcamps? That would be an excellent motivator for biologists to understand how powerful this is.

stephenturner commented 10 years ago

Great discussion. One point @rbeagrie:

The (big) compromise here is that helpers and instructors will likely spend most of the lesson dealing with installation headaches

I'm developing some material for a workshop here that I'll eventually roll into this repo, but this is a compromise that I can't make when teaching without TAs/helpers. I can't see any way out of a desktop or cloud VM with at least a handful of tools pre-installed.

jdblischak commented 10 years ago

Thanks for organizing this, @rbeagrie. Here are my thoughts:

1) Who would your target audience be, beginners, intermediates or something else?

Any bootcamp pitched at biologists, no matter how it is advertised, will likely attract many novices. I think it would be best to just prepare for this.

2) Are there any tools you would like to be covered (I'm thinking bedtools, SAMtools, HTSeq, DE-seq etc.)

For the purpose of doing something interesting while reinforcing skills learned during a bootcamp, I think a strong focus on bedtools is the best option.

4) How much should we try to tie in with the relevant set of language lessons, and which language? For example, should this be something that learners can follow having covered only the concepts in novice R, or intermediate python etc?

I think good prerequisites would be novice shell and then either novice R or novice Python. There should be many bootcamps that cover this material and thus be able to use this lesson at the end.

Bigger picture, what is the goal of this lesson? Seeing as it will only take place for a few hours at the end of an already information crammed SWC bootcamp, I don't think it is feasible for the goal to be 'Teach attendees to perform an RNA-seq analysis from fastq files to list of DE genes.' This is just too far out of scope and would have to be so rushed that it would not be covered in any more depth than if the attendees just read through the basic documentation themselves. I think a better goal would be 'Show students how the basic computing skills learned during the bootcamp can be used for routine bioinformatics tasks.' This could be accomplished by piping together some bioinformatics command line tools with unix utilities, and then reading the result into Python or R and creating a quick visualization.

ctb commented 10 years ago

I agree with everything that @jdblischak says with one very important exception: most of the novice biologists I interact with have neither grounding nor specific motivation for learning anything Software Carpentry, and all the feedback I've gotten suggests that starting with a traditional SWC topic set (shell, Python) is total fail for novice biologists. I've heard from others with similar experiences.

rbeagrie commented 10 years ago

@ctb hmm not sure I can agree with you 100% there. I know several novice biologists who've been to "zero entry" bioinformatics workshops that didn't cover the shell, and they were completely lost. I take your point that a completely "off the shelf" SWC bootcamp may not be the best approach with novice biologists. However, given that people are offering these types of workshops, I think it's worthwhile giving the best demonstration we can of how the skills people have learned can be applied to their research. Considering all the caveats we've been discussing, I tend to agree with @jdblischak that bedtools is probably the best option.

stephenturner commented 10 years ago

@jdblischak , @rbeagrie : I also like the idea of teaching bedtools, but are we at risk of thinking "what do we want to teach" instead of "what do students want to learn?" I'd wager that a good number of average "biologists" (however we're defining that here) would be much more interested in some kind of finished product data analysis that's relevant to their field of study - an assembled genome, a list of differentially regulated genes from an RNA-seq experiment, annotated variants, etc. Bedtools is an infinitely useful and indispensable tool in any bioinformatician's toolbox, but I'm not completely convinced wrapping up the day with teaching a biologist how to munge genomic intervals will have a motivating and lasting impact unless that biologist eventually gets more involved in a bioinformatics lab.

rbeagrie commented 10 years ago

So are we saying that we should be discouraging people from running SWC bootcamps aimed at novice biologists, in favour of a proper data analysis (RNA-seq or whatever) workshop?

stephenturner commented 10 years ago

I don't think they're mutually exclusive. A fair number of biologists came to the two SWC bootcamps we had here, and I believe they got a lot out of it. But perhaps a domain-specific bioinformatics exercise might be better if it resulted in the participant reaching some analytical endpoint - assembly, gene list, etc (easier said than done, admittedly).

rbeagrie commented 10 years ago

I guess my opinion is that they are mutually exclusive. I'm not convinced you can teach someone something meaningful about differential expression analysis in only a couple of hours. In fact I can't think of any analytical endpoint that can be fully explored in less than a day...

wking commented 10 years ago

On Thu, Jun 12, 2014 at 01:41:38PM -0700, Rob Beagrie wrote:

I'm not convinced you can teach someone something meaningful about differential expression analysis in only a couple of hours.

Teaching the science behind the analysis (and when that particular analysis makes sense) is probably out of scope for a two-day workshop (and certainly is if you'll be discussing other things like a stock SWC workshop). Just because folks won't be taking it back to their lab unaltered doesn't mean that a short capstone example is a bad idea.

rbeagrie commented 10 years ago

Well I certainly agree that RNA-seq would be a better motivator if we can teach something in 2 hours. @stephenturner how long would you normally set aside to teach the RNA-seq example you posted above?

stephenturner commented 10 years ago

I think we all agree that we can't go into any kind of detail on theory/motivation behind any biological data analysis. But I think we can wrap up with some practical example in 1-2 hrs. Couple ideas:

  1. Given that the bootcamp covers R instead of python, we could start with a count matrix and run it through DESeq. There are only a handful of commands since DESeq2 now wraps the entire pipeline in a single DESeq function.
  2. If the bootcamp didn't cover R but covered python, we could do something with BEDTools like someone mentioned earlier. One idea: given bed file of some interesting regions (ChIP peaks, dysregulated genes, etc), and another set of regions, say, some ENCODE regions of interest, you want to ask is there significant over-representation of ENCODE features among your "interesting" features (this is actually a common question, not just some toy example). You can teach a little bedtools intersecting, discuss permutation theory, then set up a python program to do a few thousand bedtools shuffles, keep a log of your bedtools intersect results, and end up calculating a permutation p-value.
ctb commented 10 years ago

The bootcamp could also end with some plotting -- MA plots or other common differential expression plots. See bottom of

https://khmer-protocols.readthedocs.org/en/latest/mrnaseq/8-differential-expression.html

for one of our examples.

stephenturner commented 10 years ago

Agreed. With R/DESeq, plotting easy with functions coming with DESeq2 package (ma plots, volcano plots, etc). Vignette also gives code to produce others. I imagine something similar could be done with the hypothetical BEDTools example I gave earlier - something like plotting a histogram of permutation intersect results with the actual result way out on the tail. Etc.

On Fri, Jun 13, 2014 at 9:40 AM, C. Titus Brown notifications@github.com wrote:

The bootcamp could also end with some plotting -- MA plots or other common differential expression plots. See bottom of

https://khmer-protocols.readthedocs.org/en/latest/mrnaseq/8-differential-expression.html

for one of our examples.

— Reply to this email directly or view it on GitHub https://github.com/swcarpentry/bc/issues/532#issuecomment-46012012.

rbeagrie commented 10 years ago

OK great, I'm very happy with this plan! We can use @stephenturner's RNA-seq example as a starting point on the R side. I'll have a look for any bedtools examples we can build on from the python side, unless anyone can suggest any?

stephenturner commented 10 years ago

Note: those materials are very much a works-in-progress. Hoping to make some progress on that front in the next couple weeks.

rbeagrie commented 10 years ago

I've had a look around the net for BEDtools tutorial that we can build on and I like this one from Aaron Quinlan's CSHL course: https://github.com/arq5x/tutorials/blob/master/bedtools.md

I would propose to keep up to "Counting the number of overlapping features.", then add a section explaining bedtools shuffle. We can have learners write or correct a bash script that shuffles 1000 times, then a python script to read the results and plot the distribution of permuted overlaps compared to the real overlap.

gvwilson commented 8 years ago

Data Carpentry is doing this.