Closed jeromyanglim closed 12 years ago
I've previously provided a summary of thoughts on reproducible data analysis and terminology.
In terms of aspects of reproducibility, I wrote about different broad aims:
- Reproducibility
- Can the analyses easily be re-run to transform raw data into final report with the same results?
- Correctness
- Is the data analysis consistent with the intentions of the researcher?
- Are the intentions of the researcher correct?
- Openness
- Transparency, accountability
- Can others check and verify the accuracy of analyses performed?
- Extensibility, modfifiability
- Can others modify, extend, reuse, and mash, the data, analyses, or both to create new research works?
Are the methods used unambiguously communicated? This often involves the use of code. But mathematics is another relatively unambiguous language relevant to data analysis. Normal language can also be used but for certain purposes it is often ambiguous (e.g., exactly what cluster analysis algorithm was used; exactly what was done with missing data; etc.) .
A basic principle of quality systems is that processes are in some way documented.
The one-click build really appeals to me.
The quicker it is to reproduce a set of analyses, the easier it is to verify that the final result is consistent with the procedure.
In the above quote I distinguish between
A first level of reproducible data analysis is to ensure that the intended procedure was applied.
Quality assurance is partially about ensuring that analyses were performed as intended. But what are the different types of intended analyses?
One click builds facilitate both verification of quality and achieving quality.
However, adopting a one-click build approach (such as facilitated by R and knitr) on its own does not ensure a high quality product. And with effort it is possible to have a high quality product using more manual approaches.
Data is loaded into GUI statistical software. The data is irreversibly transformed and manipulated in various ways. Graphs, tables, and results at the end of this process are copy and pasted into a document for reporting purposes. Further analyses and transformations of the data are performed, and then also incorporated The analyst can't remember exactly what they did. The analyst performs a minimum of checks and balances to even see whether what they are doing makes sense. If a fatal transformation was performed the analyst is unlikely to know.
This involves saving SPSS syntax used to perform analyses. Results are then typically copy and pasted into programs like Word, or perhaps Excel for formatting, and then Word.
The syntax does permit a degree of reproducibility, although within this approach there is a wide range in the quality of syntax organisation and commentary. In better cases, the syntax will document all transformations of the data (e.g., removing cases, creating new variables, removal of outliers, any imputing of missing data, etc.). In worse cases, it is disorganised and incomplete.
I have observed this approach being applied a lot in psychology.
However, some analyses in SPSS are too complex to easily perform with syntax. E.g., you need to post-process some output; you need to use some other software to compute a value. Some processes require moving between programs and applying multiple manual steps.
This process involve creating some reproducible report and then manually incorporating the results into a static document. This second step is often performed by another person.
For example I once did an analysis for another academic where I analysed the data and produced a report using Sweave. I sent this to the academic and he requested additional analyses. Once this iterative process was complete, he then incorporated the analyses into the write-up of a report. The initial phase was reproducible, but there were a few manual steps in incorporating the graphs, tables, and textual results into the final document.
The process is often necessary where a collaborator is driving the project and they wish to use a document preparation system such as Microsoft Word. It can also be necessary where the publishing system requires an extensive set of stylistic elements that are difficult to produce with plain text formats.
It works reasonably well when analysis to write-up is a sequential process. However, for most projects I find that analysis and write-up iterates extensively. Even once an article is submitted to a journal, reviewers may come back and request changes. Of course, you can try to keep track of whether any changes require previous analyses to be updated, but this can be error prone.
This approach also works reasonably well where analyses play a fairly small part in the overall document.
In this situation the final product is produced using code. This includes inputting data, data transformations, analysis code, and code for incorporating the figures, tables, and text into the document.
Even in this case, there are degrees and limits to reproducibility. To take one limit, documents that report analyses (e.g., theses, journal articles, etc.) are highly interconnected documents. Much of what is written is either directly or indirectly dependent on the results of the analyses. In a direct sense, there are sentences that summarise the results in a table or a graph, or there are sentences that summarise the significance or direction of an effect. It would generally be too much work to have such text conditionally displayed based on the results of the study.
However, once you break away from reproducibility, there is always the risk that results will change as a result of a tweaking of preliminary analyses, and that the conditional text will need to be updated.
There are several implications of this. It is good to
I was wanting to conceptualise reproducible data analysis in a broader context.