A capstone example for Bioinformatics in R

rbeagrie commented 10 years ago

The idea here is to show learners how to apply novice R lessons to bioinformatics. This is a very hastily prepared pull request for work in progress but contributions and suggestions very welcome!

drlabratory commented 10 years ago

Working from analysis.Rmd, you didn't commit data/counts.txt so the second code chunk fails.

gvwilson commented 10 years ago

Does the Python lesson in #608 have that file?

sritchie73 commented 10 years ago

data/counts.txt doesn't appear to be in #608 either.

rbeagrie commented 10 years ago

Oops sorry about that! Included the counts.txt file now

rbeagrie commented 10 years ago

OK this PR is also up to date with development in London

naupaka commented 10 years ago

Nice exercise!

BernhardKonrad commented 10 years ago

Hi, thanks for this work!

With regards to my comment: I'm not in bioinformatics, so if you assume prior subject knowledge then some of these can be disregarded. In that case, I suggest stating who the lecture is tailored for at the beginning.

It would be nice to have a one-line comment on what the libraries that you load are going to be useful for.
In the Introduction, I don't know the terms FASTQ, quality scores, counts matrix, and in general the motivation is quite brief. It would also be nice to mention what we are trying to achieve in this lecture.
When you mention that you can look at the file using head, why not just do it there and show the result?
We don't need the information on gene position. Why not? What are we doing instead?
We can rename the columns to something a bit more readable. and the we choose ctl1 and uvb? Are the abbreviations common enough, could you please remind the reader what they stand for?
# Using gsub -- reproducible I think this should read robust instead.
In Exercise 1, can you remind me that the each row is a gene (if that's the case)?

BernhardKonrad commented 10 years ago

Comments on Data investigation using base R

Do the students know grep, otherwise a few words about what it is would be helpful.
I would like to see a comment/reminder on apply(countData2[, ctlCols], 1, mean), even though they know apply already it is nice to get a reminder of what we are doing and why, and what the 1 does.
Please mention that we add new columns (ctlMean and uvbMean) to the data.frame and why (for ggplot, I assume).
From the plot, how do you answer your question Are there any outliers??
#Find candidate differentially expressed genes needs a space after # to render properly
Could you please mention that the reason why log2 and 0 produce NA/Inf/-Inf is mathematical (and not related to eg R or the dataset).
countData2 <- subset(countData2, (countData2$ctlMean > 0 | countData2$uvbMean > 0)) should this be an AND instead to avoid dividing by zero later?
The outlier analysis is really nice.

sritchie73 commented 10 years ago

@BernhardKonrad lots of good points. A couple of additions and explanations:

It's not clear to me why we create countData2 and add new columns to it. I was going to rewrite that section yesterday, but didn't get time.
Agreed on the robust instead of reproducible. I believe @gvwilson can back me up here, but most scientists don't find reproducibility compelling enough to change their workflow (i forget whether I read it in one of the SWC papers or heard it in @gvwilson's PyCon talk), so it's better to use another motivating argument.

liz-is commented 9 years ago

Thanks for all the comments and suggestions! My responses to you all are mixed together but I've tried to put them in the order of the lesson itself...

You can see my most recent edits here as I don't think it's been merged yet.

First, a note on audience. I believe that as with the Python capstone, the audience is intended to be biologists with an idea of why we'd do an RNAseq experiment but no / very little programming experience.

Introduction - I removed jargon by shortening sentences rather than re-writing, as I realised it was unnecessary :) I think that covering the full workflow of sequencing analysis, including explaining what FASTQ files are, would be too much for a 1.5-2hr lesson so opted to avoid mentioning them instead.

Is there a way to get head output from the shell embedded into an Rmd file? If there is I can add this.

Gene position information - 'what are we doing instead?' I don't really know what you mean here. We don't need to know where the genes are for this analysis. I've expanded that sentence a bit which might make it clearer.

Copying countData - my original idea was to add the mean columns to the data.frame and then select only the necessary columns when later converting to a matrix for DESeq (although all the rows with a mean expression of zero will have been removed...), then creating and working on a copy was suggested. Alternatively we could create a new data frame just for the mean data? I've left this bit as is for now as I'm not sure what would be preferable.

Answer to 'are there any outliers?' is kinda subjective (and hard to tell on these plots!) :) this is why we use DESeq! It's more a point for discussion than supposed to have a definitive answer.

countData2 <- subset(countData2, (countData2$ctlMean > 0 | countData2$uvbMean > 0)) No, this was left in to allow for discussion of adding pseudocounts. Maybe adding pseudocounts should be part of the lesson rather than just for discussion?

Explaining Bioconductor - what additional explanation do you think is needed here?

BernhardKonrad commented 9 years ago

Hi and thanks for these changes, this is already much better!

You can get the output of head with system("head data/counts.txt", intern = TRUE)

With "what are we doing instead?" I wanted to ask you to motivate and define the next step in the analysis.

Similar with the answer to "are there outliers?": Your point is that it is hard to tell, so I suggest stating that in the document. This way you drive home the point of motivating DESeq. Ideally you refer back to this plot after your improved analysis and plot.

I don't remember what you mean by pseudocounts, but I suggest either making the point here more clear or drop it altogether.

jdblischak commented 9 years ago

I assume this lesson is in the same state as #608. We'll close for now and this can be revisited after #759.

swcarpentry / DEPRECATED-bc

A capstone example for Bioinformatics in R #609