psych-ds / psych-DS

Welcome to Psych-DS! If this is your first time visiting a Github repository, look to the left/down to the README (below the repository files.) Psych-DS is a specification for behavioral datasets - JSON-LD metadata, predictable directory structure, and machine-readable specifications for tabular datasets in behavioral research
Creative Commons Attribution 4.0 International
77 stars 6 forks source link

raw vs. processed vs. source data #16

Closed mekline closed 5 years ago

mekline commented 5 years ago

(Archiving some conversations from the google doc here: I think we're settling on a decision to specify raw, source, and processed data like this, but chime in if you disagree!

Below copying some archived comments from the draft.

mekline commented 5 years ago

@tyarkoni This division makes a lot of sense in the fMRI context, where datasets are usually massive, preprocessing generally produces files as big as the originals, and it's not usually possible to do analysis directly on the "raw" data. For psych, I'm not sure this makes as much sense. Most analyses, including data preprocessing, amount to at most a few hundred lines of code, and few large files are generated. I think we may want to reconsider this and instead introduce a standardized way of structuring results (e.g., put everything under results/ or analyses/). This will also facilitate distribution of datasets that include all results, generated figures, etc., which the current approach would discourage.

@mekline The particular workflow I have in mind here is one I encounter a fair amount where a researcher has one original dataset, then makes a new excel sheet/doc where they drop outlier participants, and then they make another one with just participants who scored above some threshold, but then another one with all scores but just the male participants, etc. This is maybe not a desirable workflow, but it's one that exists, and a good incremental improvement on it is to at least ensure that the original/full version is in a special/protected location.

@jodeleeuw I’d like to see this distinction present. A workflow that I often use is to generate intermediate CSV files from R so that I can complete portions of an analysis without rerunning what could be a pre-processing step. I usually stick all my data in /data and then have /data/raw and /data/generated. Sometimes this step is important for computational efficiency if I am dealing with a very large N data set and a moderately expensive computation. (Admittedly, many time this isn't necessary and instead just helps me separate the analysis scripts into manageable chunks).

@tyarkoni This appears to contradict point (3) above. FWIW, I generally favor allowing inclusion of processed data. But I think we should formalize this a bit more. E.g., one approach would be to reserve data/ for raw data and processed/ for any derivative/aggregated/processed datasets.

@mekline : Terminology is hard and I may be missing the mark, but I'm thinking of 'raw' as some data that cannot be converted to spec, e.g. because it is an image rather than a set of variable values. And I'm thinking of 'processed' as any alternate forms of the dataset that get produced after the initial to spec version. Is this coherent?

@jodeleeuw I think of raw as untouched output from the data generating source. Sometimes the source generates spec-compliant data, sometimes not.

mekline commented 5 years ago

@tyarkoni This division makes a lot of sense in the fMRI context, where datasets are usually massive, preprocessing generally produces files as big as the originals, and it's not usually possible to do analysis directly on the "raw" data. For psych, I'm not sure this makes as much sense. Most analyses, including data preprocessing, amount to at most a few hundred lines of code, and few large files are generated. I think we may want to reconsider this and instead introduce a standardized way of structuring results (e.g., put everything under results/ or analyses/). This will also facilitate distribution of datasets that include all results, generated figures, etc., which the current approach would discourage.

@mekline The particular workflow I have in mind here is one I encounter a fair amount where a researcher has one original dataset, then makes a new excel sheet/doc where they drop outlier participants, and then they make another one with just participants who scored above some threshold, but then another one with all scores but just the male participants, etc. This is maybe not a desirable workflow, but it's one that exists, and a good incremental improvement on it is to at least ensure that the original/full version is in a special/protected location.

@jodeleeuw I’d like to see this distinction present. A workflow that I often use is to generate intermediate CSV files from R so that I can complete portions of an analysis without rerunning what could be a pre-processing step. I usually stick all my data in /data and then have /data/raw and /data/generated. Sometimes this step is important for computational efficiency if I am dealing with a very large N data set and a moderately expensive computation. (Admittedly, many time this isn't necessary and instead just helps me separate the analysis scripts into manageable chunks).

@tyarkoni This appears to contradict point (3) above. FWIW, I generally favor allowing inclusion of processed data. But I think we should formalize this a bit more. E.g., one approach would be to reserve data/ for raw data and processed/ for any derivative/aggregated/processed datasets.

@mekline : Terminology is hard and I may be missing the mark, but I'm thinking of 'raw' as some data that cannot be converted to spec, e.g. because it is an image rather than a set of variable values. And I'm thinking of 'processed' as any alternate forms of the dataset that get produced after the initial to spec version. Is this coherent?

@jodeleeuw I think of raw as untouched output from the data generating source. Sometimes the source generates spec-compliant data, sometimes not.