JOSS review #3 - Githubissues

My major concerns fall into 3 categories: 1) The summary, statement of need, state of the field, and performance are not very clearly described. 2) Small edits should be made to the README and the test.sh files in order to guarantee a smooth installation and test. 3) More details should be provided for the data example, in terms of the exact processing that should be used or is used by the software.

To detail each of these categories: 1) The summary, statement of need, state of the field, and performance are not very clearly described.

This software presents what appears to be a novel statistical approach. However, it is not entirely what it consists of. The authors mention that a previous approach "could predict the correct p-values" but that "it is not precise enough for p-value correction procedures that are sensitive to very small p-value changes." It is unclear how this was deduced - presumably the Lausen et al approach was tested somewhere and found lacking? Additionally, they mention that the NEEP approach constructs "a null distribution in parallel" (not sure what "in parallel" means) by re-sampling permutations. Thus, it looks like it's a bootstrap-type approach, but this should be clarified.
The presentation of the state of the field is somewhat confused. For one thing, the log-rank test used for KM approaches is actually the same as the score test applied to a CPH model for a binary variable with no tied survival times (see for example Frank Harrell's comment here https://stats.stackexchange.com/questions/175058/reproducing-the-cox-model-score-test-in-r). For another, the issue with the independent censoring assumption is not that "those who survive to the end of the study are likely to have a higher rate of survival than those who did not" (among other things, studies cannot generally wait until everyone has the event/dies, also it depends when individuals enter the study, as they generally do not enter at the same time), but rather that sicker individuals may be more likely to drop out over the course of the study (see for example https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4282781/). Additionally, it is not clear to me that the NEEP method actually "solves" either the censoring or proportional hazards assumption limitations. Rather, it deals with the issue of obtaining p-values that show good performance despite the selection of the molecular threshold. This should all be explained in more detail and better organized.
There is no evidence presented that the NEEP approach is "statistically valid," as stated in the summary. Typically this would be done by performing simulations under the null and seeing whether the type I error/FDR are controlled.

2) Small edits should be made to the README and the test.sh files in order to guarantee a smooth installation and test.

I had to use the following steps in order to make the program work on the test cases. These details are not all stated in the README and can make it harder for an inexperienced user. I was able to get it to run using the Ubuntu terminal on Windows 10, so that part is great: a) Download the entire repository and move it into my home directory. b) Rename it from "neep-master" to "neep" c) Open a terminal and change the permissions to allow reading and writing of the directory using sudo chmod -R 775 ~/neep/ d) Edit the code in ~/neep/test/test.sh so that the second line starts with "../neep" rather than "neep". This is because the neep executable is located one level up from test.sh
I did get small differences between the "test_output.txt" and the "correct_test_output.txt" files. Presumably this is due to the fact that NEEP uses re-sampling permutations? The first 4 rows/columns of correct_test_output.txt are:

ENST00000310858 | 24.70114 | 71 | 0.00E+00 ENST00000264107 | 23.5003 | 367 | 0.00E+00 ENST00000416789 | 20.87558 | 364 | 2.00E-04 ENST00000409512 | 21.00232 | 367 | 2.00E-04

The first 4 rows/columns of test_output.txt are:

ENST00000310858 | 24.70114 | 71 | 0.00E+00 ENST00000264107 | 23.5003 | 367 | 5.00E-05 ENST00000409512 | 21.00232 | 367 | 1.50E-04 ENST00000416789 | 20.87558 | 364 | 1.50E-04

The README files states about the output: "Output file The output will be in tab-separated format with a header. Column 1: molecular object id Column 2: best log-rank statistic Column 3: null empirically estimated p-value (neep) Column 4: adjusted neep Column 5: direction (i.e. which survival curve is higher, low or high)" However, this is not precise nor is it the exact format of the output file, which includes several other columns.

3) More details should be provided for the data example, in terms of the exact processing that should be used or is used by the software.

The example input file represents an expression matrix. It appears to be of TCGA data, but it unclear what cancer it's from, what technology was used to obtain the data, and what transformations have been applied. For example, it is clear that no log transformation was applied, given that the values are all non-negative and include 0. However, log transformations are generally crucial for this type of analysis. Well-developed pipelines exist for processing microarray, RNAseq, and other types of transcriptomics data.

Mentioning @majensen to follow this thread.

Thank you for such a thorough and thoughtful review!

Here are the changes I made to the paper to address this review (changes to the code/testing/documentation to follow):

The authors mention that a previous approach "could predict the correct p-values" but that "it is not precise enough for p-value correction procedures that are sensitive to very small p-value changes." It is unclear how this was deduced - presumably the Lausen et al approach was tested somewhere and found lacking? The original publication gives precision estimates of ~0.001 at its best. For high-throughput data, such as RNAseq with more than 70,000 genes, this is insufficient precision to differentiate the correct order of p-values. Further, in FDR adjustment, small changes to the lowest p-values may drastically alter the number of significant transcripts. To address this confusion, we reworded our sentence to dictate that the precision estimated in the original publication is insufficient for FDR p-value correction.

Additionally, they mention that the NEEP approach constructs "a null distribution in parallel" (not sure what "in parallel" means) by re-sampling permutations. Thus, it looks like it's a bootstrap-type approach, but this should be clarified. The reviewer is absolutely correct! We reworded "in parallel" to "as a bootstrapping procedure implemented in parallel".

The presentation of the state of the field is somewhat confused. For one thing, the log-rank test used for KM approaches is actually the same as the score test applied to a CPH model for a binary variable with no tied survival times (see for example Frank Harrell's comment here https://stats.stackexchange.com/questions/175058/reproducing-the-cox-model-score-test-in-r). We agree with the reviewer. KM is a special case of CPH. In an attempt to simplify the explanation, it seems we misrepresented NEEP! NEEP is not choosing necessarily KM over CPH, it is choosing a binary representation of transcript expression over a continuous one. We address this in the next bit.

Additionally, it is not clear to me that the NEEP method actually "solves" either the censoring or proportional hazards assumption limitations. Rather, it deals with the issue of obtaining p-values that show good performance despite the selection of the molecular threshold. This should all be explained in more detail and better organized. Yes and no. NEEP does not solve censorying and PH directly. Rather, by dealing with the issue of obtaining p-values despite a molecular expression threshold, it allows us to use a binary variable instead of a continuous one. Since molecular expression will follow many patterns of survival-significance, we do not know if the hazard function will be monotonic or proportional. Using a binary variable is certainly not a perfect solution, and as the reviewer states does not solve the proportional hazards assumption. However, a binary variable handles non-monotonic, non-proportional hazard functions better than a continuous variable (see link below). The binary variable response to this is not ideal but we need all the benefit we can get. We address this reviewer comment by restructuring our summary to explicitely discuss binary vs continuous instead of KM vs Cox PH. (https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471-2288-13-88) - see Table 1

For another, the issue with the independent censoring assumption is not that "those who survive to the end of the study are likely to have a higher rate of survival than those who did not" (among other things, studies cannot generally wait until everyone has the event/dies, also it depends when individuals enter the study, as they generally do not enter at the same time), but rather that sicker individuals may be more likely to drop out over the course of the study (see for example https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4282781/). I agree with the second part. We state "rate of survival" to express the probability density function of the lifetime distribution. For studies with an end-date, individuals are censored because of this end-date and not their death. For these end-date censored individuals, their days to death > f(study length, diagnosis date). If we assume that the date of diagnosis is unrelated to the hazard function (which seems reasonable), then we could expect that if given their true days to death (which were censored) they would have a probability density function with a mean survival time larger than those who were not censored. So, I think that what the reviewer states and what we state are saying very similar concepts (if not the same). Ultimately, the logrank power is also dependent on censoring assumptions. So, to address this comment, I removed the statement that "we get around the first assumption by conducting splits" since in retrospect it cannot be validated as true. We also remove discussion of the censorship assumption since it is not directly addressed by NEEP's methodology.

There is no evidence presented that the NEEP approach is "statistically valid," as stated in the summary. Typically this would be done by performing simulations under the null and seeing whether the type I error/FDR are controlled. Because the p-values from NEEP are empirically determined, the type 1 error/FDR are guaranteed to be controlled. For instance, simulations of random molecular expression vectors is equivalent to permuting patients. I.e. the null distribution = random simulations. This is why our method is so precise but also why it is relatively computationally expensive. To address this comment, we explain how the null distribution is equivalent to random expression vectors (p-values are uniform under the null), and that the type 1 error and FDR are guaranteed to be controlled. We remove saying it is "statistically valid" since this is implied after mentioning type 1 error and FDR.

a) Download the entire repository and move it into my home directory. b) Rename it from "neep-master" to "neep" I added that the installation requires git and added "git clone https://github.com/thecodingdoc/neep.git" as the first command line.

c) Open a terminal and change the permissions to allow reading and writing of the directory using sudo chmod -R 775 ~/neep/ Changing the permissions is typically required if downloading from the repositories master instead of using git. So, the "git clone" line added above should handle the permission issue. To handle cases where the user doesn't want to use git, I added to the README: "If git is not used but the repository zip is downloaded instead, the directory name should be changed to neep and chmod should be used to allow reading and writing. "

d) Edit the code in ~/neep/test/test.sh so that the second line starts with "../neep" rather than "neep". This is because the neep executable is located one level up from test.sh In the README installation instructions, I specify to add the neep executable to the path variable. In this way, the test.sh can be run even if they remove the executable from the original file. I'm concerned that if the 'mv neep /usr/bin/' in order to put it into the path, that using '../neep' will not work. If they move neep to their path, 'neep' will work no matter what.

I did get small differences between the "test_output.txt" and the "correct_test_output.txt" files. Presumably this is due to the fact that NEEP uses re-sampling permutations? The reviewer is correct! Every time that NEEP is run, the results will be different (with a precision depenent on the size of the null distribution). However, their untied order, the minimum p-values, and the thresholds should never change. Upon further inspection, those two lines had identical adjusted NEEP p-values so their order is not differentiated. I've also specified in the README the parts of the file that should be the same and which will change. "The output listed in "test_output.txt" should be nearly identical to that in "correct_test_output.txt". The ID, the BEST_STATISTIC, the BEST_SPLIT, and the DIRECTION columns should be identical. The remaining columns are subject to changes from the null distribution bootstrapping procedure." I also mentioned that the order of lines with identical adjusted NEEP p-values may be in a different order.

However, this is not precise nor is it the exact format of the output file, which includes several other columns. Another great catch! Since starting the review I had added lines to the output of NEEP which corresponded to survival analysis effect sizes. To address this, I added the missing columns into the README. I also mentioned what the columns were in more detail, and that each line corresponds to one line from the expression matrix input. I also mention that the order of the output is from lowest to highest adjusted NEEP p-value.

The example input file represents an expression matrix. It appears to be of TCGA data, but it unclear what cancer it's from, what technology was used to obtain the data, and what transformations have been applied. For example, it is clear that no log transformation was applied, given that the values are all non-negative and include 0. However, log transformations are generally crucial for this type of analysis. Well-developed pipelines exist for processing microarray, RNAseq, and other types of transcriptomics data. Yes, log transformation are crucial if parametric analysis is performed. However, NEEP uses non-parametric survival analysis. And the relative size of the expression values within an expression vector do not impact the results. In other words, a log transformation doesn't impact the results of NEEP.

Since the data is not used for downstream biological analysis, I don't think details of its generation ought to be included. I think it would be a distraction, since NEEP works on any high-throughput survival analysis data: PPI, gene expression, transcript expression, etc. that the user might want to use. Any method used to preprocess the data that doesn't alter the order of expression across the patients won't effect NEEP. However, you are right to suggest that we give credit to TCGA for the data source. To address this comment, we explicitely mention that as long as patient expression order is not effected, NEEP will give the same results. Further, I give credit to TCGA LUAD as the data source and specify that it has been modified to be a test example. I also give a warning that the test data should not be used for downstream biological analyses.

To address the need to express what kind of data NEEP takes in, I specify that any kind of expression data can be used, giving some examples.

Thank you for the revision! Most of my comments have been satisfactorily addressed. I do have a few more notes, some of which are due to my improved understanding of the software/method:

1) My understanding is that the Lausen approach is used to allow for valid inference when choosing the optimal cutoff because using the original p-value is based on making a choice after having seen the data. The statistical reason is the fact that the distribution of p-values under the null is not Uniform(0,1) (you should state that it should be Uniform(0,1), not just uniform, in the paper) but you should provide the intuition that this is because if a researcher picks the cutoff after "looking at the data" (cf Gail and Green https://www.jstor.org/stable/pdf/2529745.pdf, which is cited in the Lausen paper), this introduces a type I error inflation due to essentially "hidden" multiple testing (basically p-hacking) - so instead one has to look at the distribution of the maximum of the test statistic. 2) Essentially your approach speeds up the Lausen method and applies it to many gene/molecular measurements at once, additionally providing the FDR, correct? I think this is much clearer right now, but should be emphasized a bit more, especially in the Statement of Need, which doesn't clearly explain what the software does. Also, for the FDR, you should cite the method you are using (Benjamini-Hochberg?) 3) In the README, it would be good to explain or give a link on how to add ~/neep to the PATH, eg:

export PATH=$PATH:~/neep

4) The output files have 10 columns and only 9 are described in the README - you forgot the BEST_SPLIT column 5) In the README you state 'The output listed in "test_output.txt" should be nearly identical to that in "correct_test_output.txt". The ID, the BEST_STATISTIC, the BEST_SPLIT, and the DIRECTION columns should be identical. The remaining columns are subject to changes from the null distribution bootstrapping procedure. For rows with tied adjusted NEEP p-values, their order might be different.' Actually, I think all columns except for the p-value and the FDR are identical between test_output.txt and correct_test_output.txt, since those are the only two affected by the permutations. In the future, it may be helpful to have an option for the seed so the results are truly reproducible. 6) I assume that the -n option under "Usage" in the README specifies the number of permutations. I'm not sure what "size of null distribution to sample" means exactly. This should be clarified.

Thanks again for your thoughtful review of my response. I have addressed your remaining comments below.

The Lausen approach is used for valid inference when choosing the optimal cutoff because the optimal p-value is chosen after looking at the data; should say Uniform(0,1); type 1 error inflation due to "hidden" multiple testing".

Yesish. The original reason for this inflation was absolutely because of researchers option of choosing the threshold which fit their needs (one which gave them a significant p-value). However, in the light of >100,000 simultaneous log-rank tests, the problem is no longer a researcher extracting their favorite results after "looking at the data", but instead is the consequence of making an arbitrary threshold choice. If a researcher chooses and sticks to a single threshold, their >100,000 tests are indeed statistically valid. However, the researcher must somehow justify this threshold. NEEP is for those situations where a single threshold choice is unjustifiable (arbitrary); in which case, choosing a single threshold is unacceptably sensitive to the threshold choice (the cited sehgal2015 paper).

To address this comment, I replaced mentions of uniform with Uniform(0,1). I added a sentence that specifies that a single threshold choice cannot be justified for molecular survival analysis based on experimental design. This is in-place of a sentence which provides intuition of a researcher choosing a cutoff after looking at the data. We also added a sentence and cited our paper which showed that small changes in the threshold produced proportionally larger changes in the lists of significant log-rank tests. We believe that not having a justification for a specific threshold in combination with the point that choosing a threshold is sensitive to patient re-sampling / produces a proportionally large change in the results, indicates to the reader the need for a method which provides a statistically valid survival analysis approach in this scenario.

Our approach speeeds up the Lausen method and adds FDR correction; should cite FDR method.

The Lausen paper is trying to provide a solution for constructing a single log-rank test, without needing to construct the null distribution every single time. In fact, they use MC simulations to provide their gold-standard to compare with their Gaussian-based estimation method, which they deemed a conservative estimate. Since then computers have developed enough, where we believe an estimation method is no longer necessary, and the problem that we address is one massively multiple log-rank tests. So, in effect, we are 1) filtering out molecular elements which do not share from a common null distribution, 2) using a nearly brute-force approach (used by Laussen to create their gold-standard) to generate an unbiased estimate of the null distribution, and 3) using that null to correct p-value. Ultimately, we are not using Lausen's method, rather the method they considered to be their gold-standard, which they were trying to avoid since it was not pragmatic to do this consistently across survival analysis research.

To address this comment, we added two sentences to the research purpose in the statement of need to address the reasoning behind why/when NEEP is beneficial, specifically in the context of choosing a threshold. We also added a citation for BH p-value correction. Then, I reworded the problem solved section to directly state how NEEP addresses this problem.

In the README, it would be good to explain or give a link on how to add ~/neep to the PATH

Good point! To address this comment, I did as suggested.

Forgot to address BEST_SPLIT column

Woops. I thought I got them all because I started at 0 when reading the file. I've added in the missing description.

Actually, I think all columns except for the p-value and the FDR are identical between test_output.txt and correct_test_output.txt, since those are the only two affected by the permutations.

You are absolutely correct. I'll make the changes in the readme to suggest to that all the non-pvalue columns should be identical.

I assume that the -n option under "Usage" in the README specifies the number of permutations. I'm not sure what "size of null distribution to sample" means exactly. This should be clarified.

Yes, that is correct! Ultimately, the size of the null distribution that you are sampling is the same as the number of permutations. However, I can see how this is needlessly obfuscative. I put "number of bootstrap samples". I'm worried that using "permutations" also be unclear since permuting is canonically an alternative to bootstrapping (In our case, we our bootstrapping permutations since the permutations themselves describe the null).

thecodingdoc / neep

JOSS review #3 #5