Closed rbeagrie closed 9 years ago
Working from analysis.Rmd, you didn't commit data/counts.txt
so the second code chunk fails.
Does the Python lesson in #608 have that file?
data/counts.txt
doesn't appear to be in #608 either.
Oops sorry about that! Included the counts.txt file now
OK this PR is also up to date with development in London
Nice exercise!
Hi, thanks for this work!
With regards to my comment: I'm not in bioinformatics, so if you assume prior subject knowledge then some of these can be disregarded. In that case, I suggest stating who the lecture is tailored for at the beginning.
FASTQ
, quality scores
, counts matrix
, and in general the motivation is quite brief. It would also be nice to mention what we are trying to achieve in this lecture.head
, why not just do it there and show the result?We don't need the information on gene position
. Why not? What are we doing instead?We can rename the columns to something a bit more readable.
and the we choose ctl1
and uvb
? Are the abbreviations common enough, could you please remind the reader what they stand for?# Using gsub -- reproducible
I think this should read robust
instead.Comments on Data investigation using base R
grep
, otherwise a few words about what it is would be helpful.apply(countData2[, ctlCols], 1, mean)
, even though they know apply
already it is nice to get a reminder of what we are doing and why, and what the 1
does.Are there any outliers?
?#Find candidate differentially expressed genes
needs a space after #
to render properlylog2
and 0
produce NA/Inf/-Inf is mathematical (and not related to eg R or the dataset).countData2 <- subset(countData2, (countData2$ctlMean > 0 | countData2$uvbMean > 0))
should this be an AND instead to avoid dividing by zero later?@BernhardKonrad lots of good points. A couple of additions and explanations:
countData2
and add new columns to it. I was going to rewrite that section yesterday, but didn't get time.Thanks for all the comments and suggestions! My responses to you all are mixed together but I've tried to put them in the order of the lesson itself...
You can see my most recent edits here as I don't think it's been merged yet.
First, a note on audience. I believe that as with the Python capstone, the audience is intended to be biologists with an idea of why we'd do an RNAseq experiment but no / very little programming experience.
Introduction - I removed jargon by shortening sentences rather than re-writing, as I realised it was unnecessary :) I think that covering the full workflow of sequencing analysis, including explaining what FASTQ files are, would be too much for a 1.5-2hr lesson so opted to avoid mentioning them instead.
Is there a way to get head
output from the shell embedded into an Rmd file? If there is I can add this.
Gene position information - 'what are we doing instead?' I don't really know what you mean here. We don't need to know where the genes are for this analysis. I've expanded that sentence a bit which might make it clearer.
Copying countData - my original idea was to add the mean columns to the data.frame and then select only the necessary columns when later converting to a matrix for DESeq (although all the rows with a mean expression of zero will have been removed...), then creating and working on a copy was suggested. Alternatively we could create a new data frame just for the mean data? I've left this bit as is for now as I'm not sure what would be preferable.
Answer to 'are there any outliers?' is kinda subjective (and hard to tell on these plots!) :) this is why we use DESeq! It's more a point for discussion than supposed to have a definitive answer.
countData2 <- subset(countData2, (countData2$ctlMean > 0 | countData2$uvbMean > 0))
No, this was left in to allow for discussion of adding pseudocounts. Maybe adding pseudocounts should be part of the lesson rather than just for discussion?
Explaining Bioconductor - what additional explanation do you think is needed here?
Hi and thanks for these changes, this is already much better!
You can get the output of head
with system("head data/counts.txt", intern = TRUE)
With "what are we doing instead?" I wanted to ask you to motivate and define the next step in the analysis.
Similar with the answer to "are there outliers?": Your point is that it is hard to tell, so I suggest stating that in the document. This way you drive home the point of motivating DESeq. Ideally you refer back to this plot after your improved analysis and plot.
I don't remember what you mean by pseudocounts, but I suggest either making the point here more clear or drop it altogether.
I assume this lesson is in the same state as #608. We'll close for now and this can be revisited after #759.
The idea here is to show learners how to apply novice R lessons to bioinformatics. This is a very hastily prepared pull request for work in progress but contributions and suggestions very welcome!