[joss review] Software paper

jenzopr commented 4 years ago

This issue is part of the JOSS review.

Software paper review

In their paper, Paez et. al. identify unfulfilled needs during the translation of findings from computational scRNA-seq analysis to downstream wet-lab methodologies. Especially for marker gene detection, a critical step in characterization of cellular (sub)populations, there seems to be no applicable method that transfers knowledge from in silico to the bench. They present scTree, an R package for marker gene detection that employs random forests for variable selection and a classification tree that resembles FACS gating strategies, thereby enhancing interpretability and application of detected marker genes in downstream wet-lab experiments. Overall the manuscript is well written and does not require major editing for structure or language. They benchmark the quality of their method quite elegantly using recall statistics on test data that has been left out during training.

Major issues

Target audience: From the introduction, it becomes somewhat clear that scTree is a tool/package, but the anticipated target audience (e.g. (computational) biologists with knowledge of the R programming language and Seurat) does not become apparent. Please improve wording of the last paragraph of the introduction to specify the target audience.
State of the field: Authors describe a range of tools in the field that fail to directly connect findings [...] to be used downstream. In my view, this comparison is problematic, since many of the tools mentioned to not claim to achieve that. For example, short sequence alignment, barcode calling, preprocessing, clustering and differential gene expression are all needed in the first place in order to be able to transfer findings to the bench. Please improve wording of this paragraph and add a more in-depth analysis of tools that claim to find biologically relevant and interpretable marker genes.
Figure legends could be enhanced to increase understanding of the figures. E.g. it is not clear to me, where predicted identities in Figure 2 come from, what message they convey and how those predictions integrate into the logic of the benchmark. Further, to measure similarity of a prediction and the real identity, the adjusted rand index could be used.

Minor issues

Page 1, paragraph 3: A short explanation of what marker genes are is missing
Page 2, paragraph 3: random forests has been previously

mschubert commented 4 years ago

I agree with the issues @jenzopr raised.

In addition, I would like to see improvements for the following parts:

The software paper is quite verbose and takes a while to come to a point (more akin to a "tranditional" journal introduction compared to other JOSS papers). As far as I see it, the main novelty is that scTree suggests a (small) set of genes that separate populations. Why not state this right in the beginning?
I do not see how references to "other people did scRNA-seq alignment [etc.]" are relevant for this software paper. Doesn't it make more sense to focus on what this software does? (i.e., find sortable marker genes between populations)
Based on the previous points, the "summary" part of the introduction could be simplified to something like this?:

Single-cell RNA sequencing (scRNA-seq) is a now commonly used technique to measure the transcriptome of many cells. Clusters of these transcriptomes identify cell populations (ref). There are multiple methods available to identify "marker" genes that separate these populations (refs). However, there are usually too many genes in these lists to directly suggest an experimental follow-up strategy for selecting them from a bulk population (e.g. via FACS (ref)). Here we present scTree, a tool that aims to provide a minimal set of genes to separate populations in scRNA-seq in a follow-up experiment.
Implementation, citations for ranger package: only one is the package citation, are the others relevant for this software paper? If I as a reader am particularly interested in how ranger implements their RF, I'd look at the citations in that paper
Implementation, benchmark: I'm confused about what you are looking at in your benchmark. Doesn't scTree use RF? Why is it listed with 6 different ML methods? (are those available but RF is default?) Where does the t-test and the Wilcox come in? This is not clear from the software paper.
Implementation, benchmark: You make the point that previous marker gene methods are not suitable because there are too many genes that can't be used in a sorting experiment. Hence, wouldn't a better benchmark a different marker gene method (& then using e.g. top 5 genes) vs. scTree to underline this point?
Minor: Reference "Satija-Lab" should probably contain a URL, ideally via https://archive.org/web/ to prevent link rot
Minor: typo "clasification"
Minor: typo "ha been previously"

natallah commented 4 years ago

Thank you @jenzopr and @mschubert for your comments! I have addressed the major and minor issues .

jenzopr commented 4 years ago

From my point of view, the paper gained substantially from your edits! Nice work :+1:

jspaezp / sctree