Implement Psych-DS Response Downloads

bleonar5 commented 2 months ago

This PR closes #1374

This adds a new response view for the study responses page which downloads the frame data from all responses into a zip file structured according to the rules of psych-DS.

The psych-ds option is listed below the csv frame data options, and the text for the explanatory label is provisional and open to any edits.

A certain intuitive amount of information about the study/lab/user are collected and inserted into global and sidecar metadata files.

The frame data from each individual response are found under the data subdirectory, with keyword labels in the filename for study name, response id, and child id (UUIDs are not used here because of issues with psych-ds filename formatting).

A materials subdirectory is also created, which for now just contains a copy of the study protocol.

Screenshots:

becky-gilbert commented 2 months ago

@bleonar5 looks like there's one failing test:

ERROR: test_can_see_response_views_as_study_admin (exp.tests.test_response_views.ResponseViewsTestCase)

Traceback (most recent call last): File "/home/circleci/project/exp/tests/test_response_views.py", line 176, in setUp reverse( File "/home/circleci/.cache/pypoetry/virtualenvs/lookit-api-3aSsmiER-py3.9/lib/python3.9/site-packages/django/urls/base.py", line 86, in reverse return resolver._reverse_with_prefix(view, prefix, *args, **kwargs) File "/home/circleci/.cache/pypoetry/virtualenvs/lookit-api-3aSsmiER-py3.9/lib/python3.9/site-packages/django/urls/resolvers.py", line 698, in _reverse_with_prefix raise NoReverseMatch(msg) django.urls.exceptions.NoReverseMatch: Reverse for 'study-responses-download-frame-data-zip-psychds' not found. 'study-responses-download-frame-data-zip-psychds' is not a valid view function or pattern name.

I think is just because of capitalization of DS in the name used for the URL path:

path(
        "studies/<int:pk>/responses/all/download_frame_zip_psychDS/",
        StudyResponsesFrameDataPsychDS.as_view(),
        name="study-responses-download-frame-data-zip-psychDS",
    ),

Also looks like there's a linting failure, so I'm guessing you don't have the pre-commit checks set up locally? I can't remember how to set that up (sorry!) but in the meantime I can fix it locally and push the change to this branch, if that's ok with you.

bleonar5 commented 2 months ago

@bleonar5 looks like there's one failing test:

ERROR: test_can_see_response_views_as_study_admin (exp.tests.test_response_views.ResponseViewsTestCase)

Traceback (most recent call last): File "/home/circleci/project/exp/tests/test_response_views.py", line 176, in setUp reverse( File "/home/circleci/.cache/pypoetry/virtualenvs/lookit-api-3aSsmiER-py3.9/lib/python3.9/site-packages/django/urls/base.py", line 86, in reverse return resolver._reverse_with_prefix(view, prefix, *args, **kwargs) File "/home/circleci/.cache/pypoetry/virtualenvs/lookit-api-3aSsmiER-py3.9/lib/python3.9/site-packages/django/urls/resolvers.py", line 698, in _reverse_with_prefix raise NoReverseMatch(msg) django.urls.exceptions.NoReverseMatch: Reverse for 'study-responses-download-frame-data-zip-psychds' not found. 'study-responses-download-frame-data-zip-psychds' is not a valid view function or pattern name.

I think is just because of capitalization of DS in the name used for the URL path:
path(
        "studies/<int:pk>/responses/all/download_frame_zip_psychDS/",
        StudyResponsesFrameDataPsychDS.as_view(),
        name="study-responses-download-frame-data-zip-psychDS",
    ),
Also looks like there's a linting failure, so I'm guessing you don't have the pre-commit checks set up locally? I can't remember how to set that up (sorry!) but in the meantime I can fix it locally and push the change to this branch, if that's ok with you.

I sorted it out and it looks like the tests are passing now. Thanks!!

bleonar5 commented 2 months ago

Thanks Becky for the thorough review and great suggestions! I've made the following changes:

uuids are used in filenames now, but with hyphens removed
eligibility will now take the form of a list to accommodate empty and multi-item values
variableMeasured now removes duplicate values ("key" and "value", btw, are actual column names from the frame data output)
the materials directory now includes both the study protocol and the protocol generator

mekline commented 2 months ago

An initial idea as I'm getting started - would it be possible in materials to also get a json that includes all the study ad details? This is something people occasionally request a way to download.

mekline commented 2 months ago

Okay - I will likely have some things I want to try once this is on staging, but here is my first round of edits/suggestions. Please kick the tires/tell me why I'm wrong about any/all of this! I will not include edits to website copy/instructions on this round, just noting here that I'll want to address that in a subsequent round.

Principles guiding the following suggestions include:

Providing a transparent mapping from the existing data people are used to getting, including some compromises regarding how PII data is presented, but also...
Maintaining good security standards, which mainly impacts being super careful with child_id values, but also clearly labeling the PII we know about
Making the best things for people to do, also the easiest for them to do.

Checkboxes (sorry)

Below the checkboxes for what data elements to include (child birthdate, rounded date, gender, hashed IDs, etc...), and at the start of the Psych-DS section, include the following checkboxes with the following default values:

[ x ] Put personally identifying information (PII) in a separate overview file

(See below for how these files are to be constructed)

There is a big security-usability tradeoff here. Many CHS users are not familiar with merging tables, and therefore they want the child's birth date, name, whatever info they intend to use for any purpose during the study (e.g. making thank you certificates) to all be in the same table right alongside all the other information about that response. For security reasons, we would rather that e.g. exact birthdays and research variables lived in separate places, but mandating this would create a hardship for a fairly large chunk of the researcher user base.

Note: Be careful with implementation here especially around rounded ages. They should be counted as "not PII", but the current data downloads mark the files as "identifiable" when the default checkboxes are selected (which includes rounded age but none of the actual values given in the list: "Files with names, global IDs, birthdates, exact ages at participation, or "additional info" fields are marked as identifiable in the filename.") Something funny is up here.

[ x ] Use shorter filenames

As noted by others, using full UUIDs produces eye-wateringly long filenames. As a compromise, I took a look at a collision calculator (https://devina.io/collision-calculator) and concluded that for most studies (<500 participants), the risk of collision when using just the first 8 digits (short UUID) is probably acceptable. Note to future self: include this link when writing helptext for these check boxes.

Structure of downloads

Metadata

I don't have a local dev environment for the site, so could someone please email me a couple zipped up examples of the resulting data downloads so I can have a look at the metadata? It's okay if the other changes haven't been made yet!

mekline commented 2 months ago

Structure of downloads

(oops, hit send too early)

As proposed by Brian:

dataset_description.json
data/
     study-[id]_response-[id]_child-[id]_data.csv <- equivalent to existing framedata files
     study-[id]_response-[id]_child-[id]_data.json <- a single response object similar to what's in all_responses_identifiable?
materials/
     study_protocol.json

Changes proposed:

Remove child_id from the frame data files (to support secure data sharing practices).
Add "overview" files, which (I'm sorry) work slightly differently from the existing options on the data page.
Add raw data JSON object equivalent to existing "All response data" object (instead of the subject-specific json sidecars- I think minimizing new JSON files is important for usability, but I also haven't seen those sidecars content yet)
Include placeholder spaces for demographic and video data to be put, along with documentation cues.

New structure:

dataset_description.json
README.txt <- Textfile with overview of how this folder is organized and where to learn more about Psych-DS 
data/
     demographic/
          README_demographic.txt <- Textfile saying in short "put the demographic data here and don't share it carelessly"
     framedata-per-response/
          study-[id]_response-[id]_data.csv <- equivalent to existing framedata files
     overview/
          study-[id]_all-responses_data.csv <- (See details below! Also, abusing "all" as a filename key a little bit here, could also call it "overview-responses")
          study-[id]_all-children_pii-identifiable_data.csv
     raw/
          all_responses_identifiable.json <- equivalent to existing file!
          video/
               README_video.txt <- Textfile saying in short "put the video data here and don't share it carelessly"
materials/
     study_ad_info.json 
     study_protocol.json
     study_protocol_generator.js <- (or however this should be presented)

Explanations:

Demographics: Most researchers should leave this data alone and not use it, which is why it is carefully segregated in another tab. The name of this file should eventually be changed to match the Psych-DS pattern we are creating, e.g. study-[id]_all-demographics_pii-identifiable_data.csv, but researchers should continue to have to download them separately and drop into the folder if desired.
Video: Same thing re:dropping in videos if desired, but let's not monkey with anything regarding filenaming or anything else this time. It's in raw/ so should be ignored by the spec anyhow!
Overview files: Right now, we provide two mechanisms for separating/minimizing download of PII, and those mechanisms interact in a sort of confusing way:
- First, the checkboxes at the top give granular control over what to include in the main responses table. I really like the filenaming convention that's used here, btw: If the files are identifiable, they include that in the filename. If those PII columns are not present, we remain silent, because for all we know the participants or researchers have inserted PII into those files some other way. It is up to the researcher to figure that out, we don't label as "deidentified."
- Second, there is a separate data download made available called "Child Data", and if you read the helptext really closely, you'll see that no matter what you checked above, the full scope of PII is included in this file - yikes! This gives the counterintuitive result that you uncheck the birthday box up top, but then download a file that does actually have birthdays in it.

Here is my proposal for how we update this:

(1) Leave the checkbox options alone. See my comment about rounded birthdays above, they are NOT PII and should behave as such.

(2) Provide two alternatives for researchers, one like what they are used to getting from the "All Responses" download, and one with better security qualities (which is set to be the default option)

Option 1: A single file, which you get only if you uncheck "Put personally identifying information (PII) in a separate overview file". This should behave like the current all_responses_identifiable.csv downloads, including the columns checked above and not including those unchecked, all in a single file. Depending on what boxes are checked, this file should be named either study-[id]_all-responses_pii-identifiable_data.csv or study-[id]_all-responses_data.csv

Option 2: Two files with research data and child PII nicely separated. The first file, study-[id]_all-responses_data.csv, for optional columns should include only the checked-by-default, non-PII descriptors. That is, no matter what's checked, birthdays, names, etc. should NOT be inserted into this file. The second file, study-[id]_all-children_pii-identifiable_data.csv, as it does now, has a row per child (not response) and includes just a few columns about that child, not all the consent/study/framedata columns. This file should contain those data elements that are checked in the info table above, and not include whatever columns are unchecked. If the researcher says they don't need child or adult first names, don't go sticking them back in there!

becky-gilbert commented 2 months ago

WRT the checkbox for putting PII in a separate file:

Note: Be careful with implementation here especially around rounded ages. They should be counted as "not PII", but the current data downloads mark the files as "identifiable" when the default checkboxes are selected (which includes rounded age but none of the actual values given in the list: "Files with names, global IDs, birthdates, exact ages at participation, or "additional info" fields are marked as identifiable in the filename.") Something funny is up here.

An issue for the 'identifiable' label being incorrect can be found here: #1424

bleonar5 commented 2 months ago

@mekline What do you think about the possibility of including the full set of default supplementary directories in the psychds download option. By these I mean the ones that the schema recognizes but does not require, such as "products", "analysis", etc? We wouldn't have any downloads provided in these subdirs, just explanatory READMEs, but it could be good for nudging users to understand that they can/should use the psychds folder for their own storage purposes, not just as a download format.

mekline commented 1 month ago

First of all, here are rewrites for the video and main README files. Once we've hashed through the rest of the issues we'll want to do a close reread of the latter to make sure it's totally accurate! Related quibbles/other feedback coming after this...

README_video.md

README_main.md

mekline commented 1 month ago

Other assorted comments on the structure of the download!

Folder name:

How about [STUDYNAME]--psychds/ ? Since it's the project folder I don't think it should include "data"

materials/study_ad.json

This looked much shorter than I expected, but could be because it's a dummy study, or alternately I failed to clarify what was supposed to go here.

Can you confirm that a real study pulls stuff from both the ad and the longer description? (We split this into 2 tabs/areas in researcher view...) Some examples of things I expect to see here include e.g.

the criteria expression
study inclusion/exclusion lists
all text shown to families (compensation statement)
the blurb that shows on the main study search page....
etc.

data/demographic

Hmmm, on reflection, unlike raw/, an MD file here will throw a Psych-DS error, right? Let's leave this folder out and just instruct people to go get that data if they want it. I'll assume that works, and write up the readme file to reflect this, but can revisit if needed.

dataset_description.json comment 1

Can we see it with the more elaborated data dictionary included? (IE full PropertyValue with names and definitions.) The current download system provides options for data dictionaries with definitions filled in for standard columns, and shouty caps where researchers need to provide definitions - let's include by default in Psych-DS.

Use plenty of excessive whitespace/carriage return padding for readability, e.g.:

"variableMeasured": [
  {

    "@type" : "PropertyValue",
    "name": "id",
    
    "description": "Subject identifier, a very very long definition. Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum",
    
  },

  {

    "@type" : "PropertyValue",
    "name": "age",
    
    "description": "Subject age in years.",
  }

]}

dataset_description 2

On reflection, the set of facts we know about the study creator may include more contact info than the researcher intends to share, so we should be careful since people may not open this file up to check!
For creator, include only first and last name (I think our UUID for the researcher is not ever useful outside CHS context, right? And, an ambitious annotator probably wants to put their ORCID here for a unique id)
Maybe leave email but keep as an empty string - this is the email that the researcher used as username/login info, right? Not always going to be an address the researcher wants to share/use for research correspondence.
For affiliation/organization, KEEP the contact info, but I have some questions about what this looks like for non-demo labs:

email - is this the lab's contact email? The study email?

name - this will be the name of the lab, right? Since this often doesn't include the institution, including the parent org is a great idea!

telephone - probably drop this

@id - does this go to the external website link that labs provide? The lab-specific CHS page? Something else?

Can we also add the PI's name somewhere in this file? (We produce the string "contact PI NAME at EMAIL ADDRESS" in various emails, so this info should be present somewhere) Schema.org/Organization suggests we could use "founder" for this (a bit meh, but better than the other 2 options, "sponsor" or "employee")

parentOrganization:name - Can you confirm this property comes up as real-seeming university names for actual labs?

README main:

Is there a reason for the LH padding? I didn't include it in my version, but put it back if you wish :)

mekline commented 1 month ago

Other comments for now!

Let's put the Psych-DS option up top actually!
New proposed text:

"Psych-DS formatted frame data" --> "All frame and overview datasets (Psych-DS)"

"Data(one file per response)" --> "Data (zipped directory)" and just ZIP on the button

"The Psych-DS formatted frame data includes etc. etc..." -->

This download provides a single directory containing both data and materials (i.e. current versions of the study protocol, ad text, etc.). All of the data types listed below are included in this download, though note that the filenames used are slightly different. This directory follows the Psych-DS system for data organization [link].

mekline commented 1 month ago

One more thing to check - we discussed this at various points, but can you confirm that both all_responses and all_children of the data/overview/ files will now produce only those sensitive columns that are checked, and both will be named accordingly with or without identifiable-true strings?

bleonar5 commented 1 month ago

Can you confirm that a real study pulls stuff from both the ad and the longer description? (We split this into 2 tabs/areas in researcher view...) Some examples of things I expect to see here include e.g.

Confirmed that this works for studies with real full study_ad_info

email - is this the lab's contact email? The study email?

this is the "contact_email" field for the studies_lab table. It seems to typically correspond to lab emails rather than PI emails.

@id - does this go to the external website link that labs provide? The lab-specific CHS page? Something else?

this points to the "lab_website" field in the studies_lab table

Can we also add the PI's name somewhere in this file? (We produce the string "contact PI NAME at EMAIL ADDRESS" in various emails, so this info should be present somewhere) Schema.org/Organization suggests we could use "founder" for this (a bit meh, but better than the other 2 options, "sponsor" or "employee")

PI names are available as a field but there is only a lab contact email, no PI specific email. Given this, I think it's best to slot the email under the lab object rather than the PI object.