lookit / lookit-api

Codebase for Lookit v2 and Experimenter v2. Includes an API. Docs: http://lookit.readthedocs.io/
https://lookit.mit.edu/
MIT License
10 stars 18 forks source link

Implement Psych-DS Response Downloads #1422

Closed bleonar5 closed 1 month ago

bleonar5 commented 2 months ago

This PR closes #1374

This adds a new response view for the study responses page which downloads the frame data from all responses into a zip file structured according to the rules of psych-DS.

The psych-ds option is listed below the csv frame data options, and the text for the explanatory label is provisional and open to any edits.

A certain intuitive amount of information about the study/lab/user are collected and inserted into global and sidecar metadata files.

The frame data from each individual response are found under the data subdirectory, with keyword labels in the filename for study name, response id, and child id (UUIDs are not used here because of issues with psych-ds filename formatting).

A materials subdirectory is also created, which for now just contains a copy of the study protocol.

Screenshots:

Screenshot 2024-06-10 at 2 39 33 PM Screenshot 2024-06-10 at 2 39 57 PM
becky-gilbert commented 2 months ago

@bleonar5 looks like there's one failing test:

ERROR: test_can_see_response_views_as_study_admin (exp.tests.test_response_views.ResponseViewsTestCase)

Traceback (most recent call last): File "/home/circleci/project/exp/tests/test_response_views.py", line 176, in setUp reverse( File "/home/circleci/.cache/pypoetry/virtualenvs/lookit-api-3aSsmiER-py3.9/lib/python3.9/site-packages/django/urls/base.py", line 86, in reverse return resolver._reverse_with_prefix(view, prefix, *args, **kwargs) File "/home/circleci/.cache/pypoetry/virtualenvs/lookit-api-3aSsmiER-py3.9/lib/python3.9/site-packages/django/urls/resolvers.py", line 698, in _reverse_with_prefix raise NoReverseMatch(msg) django.urls.exceptions.NoReverseMatch: Reverse for 'study-responses-download-frame-data-zip-psychds' not found. 'study-responses-download-frame-data-zip-psychds' is not a valid view function or pattern name.

I think is just because of capitalization of DS in the name used for the URL path:

path(
        "studies/<int:pk>/responses/all/download_frame_zip_psychDS/",
        StudyResponsesFrameDataPsychDS.as_view(),
        name="study-responses-download-frame-data-zip-psychDS",
    ),

Also looks like there's a linting failure, so I'm guessing you don't have the pre-commit checks set up locally? I can't remember how to set that up (sorry!) but in the meantime I can fix it locally and push the change to this branch, if that's ok with you.

bleonar5 commented 2 months ago

@bleonar5 looks like there's one failing test:

ERROR: test_can_see_response_views_as_study_admin (exp.tests.test_response_views.ResponseViewsTestCase)

Traceback (most recent call last): File "/home/circleci/project/exp/tests/test_response_views.py", line 176, in setUp reverse( File "/home/circleci/.cache/pypoetry/virtualenvs/lookit-api-3aSsmiER-py3.9/lib/python3.9/site-packages/django/urls/base.py", line 86, in reverse return resolver._reverse_with_prefix(view, prefix, *args, **kwargs) File "/home/circleci/.cache/pypoetry/virtualenvs/lookit-api-3aSsmiER-py3.9/lib/python3.9/site-packages/django/urls/resolvers.py", line 698, in _reverse_with_prefix raise NoReverseMatch(msg) django.urls.exceptions.NoReverseMatch: Reverse for 'study-responses-download-frame-data-zip-psychds' not found. 'study-responses-download-frame-data-zip-psychds' is not a valid view function or pattern name.

I think is just because of capitalization of DS in the name used for the URL path:

path(
        "studies/<int:pk>/responses/all/download_frame_zip_psychDS/",
        StudyResponsesFrameDataPsychDS.as_view(),
        name="study-responses-download-frame-data-zip-psychDS",
    ),

Also looks like there's a linting failure, so I'm guessing you don't have the pre-commit checks set up locally? I can't remember how to set that up (sorry!) but in the meantime I can fix it locally and push the change to this branch, if that's ok with you.

I sorted it out and it looks like the tests are passing now. Thanks!!

bleonar5 commented 2 months ago

Thanks Becky for the thorough review and great suggestions! I've made the following changes:

mekline commented 2 months ago

An initial idea as I'm getting started - would it be possible in materials to also get a json that includes all the study ad details? This is something people occasionally request a way to download.

mekline commented 2 months ago

Okay - I will likely have some things I want to try once this is on staging, but here is my first round of edits/suggestions. Please kick the tires/tell me why I'm wrong about any/all of this! I will not include edits to website copy/instructions on this round, just noting here that I'll want to address that in a subsequent round.

Principles guiding the following suggestions include:

Checkboxes (sorry)

Below the checkboxes for what data elements to include (child birthdate, rounded date, gender, hashed IDs, etc...), and at the start of the Psych-DS section, include the following checkboxes with the following default values:

[ x ] Put personally identifying information (PII) in a separate overview file

(See below for how these files are to be constructed)

There is a big security-usability tradeoff here. Many CHS users are not familiar with merging tables, and therefore they want the child's birth date, name, whatever info they intend to use for any purpose during the study (e.g. making thank you certificates) to all be in the same table right alongside all the other information about that response. For security reasons, we would rather that e.g. exact birthdays and research variables lived in separate places, but mandating this would create a hardship for a fairly large chunk of the researcher user base.

Note: Be careful with implementation here especially around rounded ages. They should be counted as "not PII", but the current data downloads mark the files as "identifiable" when the default checkboxes are selected (which includes rounded age but none of the actual values given in the list: "Files with names, global IDs, birthdates, exact ages at participation, or "additional info" fields are marked as identifiable in the filename.") Something funny is up here.

[ x ] Use shorter filenames

As noted by others, using full UUIDs produces eye-wateringly long filenames. As a compromise, I took a look at a collision calculator (https://devina.io/collision-calculator) and concluded that for most studies (<500 participants), the risk of collision when using just the first 8 digits (short UUID) is probably acceptable. Note to future self: include this link when writing helptext for these check boxes.

Structure of downloads

Metadata

I don't have a local dev environment for the site, so could someone please email me a couple zipped up examples of the resulting data downloads so I can have a look at the metadata? It's okay if the other changes haven't been made yet!

mekline commented 2 months ago

Structure of downloads

(oops, hit send too early)

As proposed by Brian:

dataset_description.json
data/
     study-[id]_response-[id]_child-[id]_data.csv <- equivalent to existing framedata files
     study-[id]_response-[id]_child-[id]_data.json <- a single response object similar to what's in all_responses_identifiable?
materials/
     study_protocol.json

Changes proposed:

New structure:

dataset_description.json
README.txt <- Textfile with overview of how this folder is organized and where to learn more about Psych-DS 
data/
     demographic/
          README_demographic.txt <- Textfile saying in short "put the demographic data here and don't share it carelessly"
     framedata-per-response/
          study-[id]_response-[id]_data.csv <- equivalent to existing framedata files
     overview/
          study-[id]_all-responses_data.csv <- (See details below! Also, abusing "all" as a filename key a little bit here, could also call it "overview-responses")
          study-[id]_all-children_pii-identifiable_data.csv
     raw/
          all_responses_identifiable.json <- equivalent to existing file!
          video/
               README_video.txt <- Textfile saying in short "put the video data here and don't share it carelessly"
materials/
     study_ad_info.json 
     study_protocol.json
     study_protocol_generator.js <- (or however this should be presented)

Explanations:

Here is my proposal for how we update this:

(1) Leave the checkbox options alone. See my comment about rounded birthdays above, they are NOT PII and should behave as such.

(2) Provide two alternatives for researchers, one like what they are used to getting from the "All Responses" download, and one with better security qualities (which is set to be the default option)

Option 1: A single file, which you get only if you uncheck "Put personally identifying information (PII) in a separate overview file". This should behave like the current all_responses_identifiable.csv downloads, including the columns checked above and not including those unchecked, all in a single file. Depending on what boxes are checked, this file should be named either study-[id]_all-responses_pii-identifiable_data.csv or study-[id]_all-responses_data.csv

Option 2: Two files with research data and child PII nicely separated. The first file, study-[id]_all-responses_data.csv, for optional columns should include only the checked-by-default, non-PII descriptors. That is, no matter what's checked, birthdays, names, etc. should NOT be inserted into this file. The second file, study-[id]_all-children_pii-identifiable_data.csv, as it does now, has a row per child (not response) and includes just a few columns about that child, not all the consent/study/framedata columns. This file should contain those data elements that are checked in the info table above, and not include whatever columns are unchecked. If the researcher says they don't need child or adult first names, don't go sticking them back in there!

becky-gilbert commented 2 months ago

WRT the checkbox for putting PII in a separate file:

Note: Be careful with implementation here especially around rounded ages. They should be counted as "not PII", but the current data downloads mark the files as "identifiable" when the default checkboxes are selected (which includes rounded age but none of the actual values given in the list: "Files with names, global IDs, birthdates, exact ages at participation, or "additional info" fields are marked as identifiable in the filename.") Something funny is up here.

An issue for the 'identifiable' label being incorrect can be found here: #1424

bleonar5 commented 2 months ago

@mekline What do you think about the possibility of including the full set of default supplementary directories in the psychds download option. By these I mean the ones that the schema recognizes but does not require, such as "products", "analysis", etc? We wouldn't have any downloads provided in these subdirs, just explanatory READMEs, but it could be good for nudging users to understand that they can/should use the psychds folder for their own storage purposes, not just as a download format.

mekline commented 1 month ago

First of all, here are rewrites for the video and main README files. Once we've hashed through the rest of the issues we'll want to do a close reread of the latter to make sure it's totally accurate! Related quibbles/other feedback coming after this...

README_video.md

README_main.md

mekline commented 1 month ago

Other assorted comments on the structure of the download!

Folder name:

How about [STUDYNAME]--psychds/ ? Since it's the project folder I don't think it should include "data"

materials/study_ad.json

This looked much shorter than I expected, but could be because it's a dummy study, or alternately I failed to clarify what was supposed to go here.

Can you confirm that a real study pulls stuff from both the ad and the longer description? (We split this into 2 tabs/areas in researcher view...) Some examples of things I expect to see here include e.g.

data/demographic

Hmmm, on reflection, unlike raw/, an MD file here will throw a Psych-DS error, right? Let's leave this folder out and just instruct people to go get that data if they want it. I'll assume that works, and write up the readme file to reflect this, but can revisit if needed.

dataset_description.json comment 1

Can we see it with the more elaborated data dictionary included? (IE full PropertyValue with names and definitions.) The current download system provides options for data dictionaries with definitions filled in for standard columns, and shouty caps where researchers need to provide definitions - let's include by default in Psych-DS.

Use plenty of excessive whitespace/carriage return padding for readability, e.g.:

"variableMeasured": [
  {

    "@type" : "PropertyValue",
    "name": "id",
    
    "description": "Subject identifier, a very very long definition. Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum",
    
  },

  {

    "@type" : "PropertyValue",
    "name": "age",
    
    "description": "Subject age in years.",
  }

]}

dataset_description 2

email - is this the lab's contact email? The study email?

name - this will be the name of the lab, right? Since this often doesn't include the institution, including the parent org is a great idea!

telephone - probably drop this

@id - does this go to the external website link that labs provide? The lab-specific CHS page? Something else?

Can we also add the PI's name somewhere in this file? (We produce the string "contact PI NAME at EMAIL ADDRESS" in various emails, so this info should be present somewhere) Schema.org/Organization suggests we could use "founder" for this (a bit meh, but better than the other 2 options, "sponsor" or "employee")

parentOrganization:name - Can you confirm this property comes up as real-seeming university names for actual labs?

README main:

Is there a reason for the LH padding? I didn't include it in my version, but put it back if you wish :)

mekline commented 1 month ago

Other comments for now!

"Psych-DS formatted frame data" --> "All frame and overview datasets (Psych-DS)"

"Data(one file per response)" --> "Data (zipped directory)" and just ZIP on the button

"The Psych-DS formatted frame data includes etc. etc..." -->

This download provides a single directory containing both data and materials (i.e. current versions of the study protocol, ad text, etc.). All of the data types listed below are included in this download, though note that the filenames used are slightly different. This directory follows the Psych-DS system for data organization [link].

mekline commented 1 month ago

One more thing to check - we discussed this at various points, but can you confirm that both all_responses and all_children of the data/overview/ files will now produce only those sensitive columns that are checked, and both will be named accordingly with or without identifiable-true strings?

bleonar5 commented 1 month ago

Can you confirm that a real study pulls stuff from both the ad and the longer description? (We split this into 2 tabs/areas in researcher view...) Some examples of things I expect to see here include e.g.

Confirmed that this works for studies with real full study_ad_info

email - is this the lab's contact email? The study email?

this is the "contact_email" field for the studies_lab table. It seems to typically correspond to lab emails rather than PI emails.

@id - does this go to the external website link that labs provide? The lab-specific CHS page? Something else?

this points to the "lab_website" field in the studies_lab table

Can we also add the PI's name somewhere in this file? (We produce the string "contact PI NAME at EMAIL ADDRESS" in various emails, so this info should be present somewhere) Schema.org/Organization suggests we could use "founder" for this (a bit meh, but better than the other 2 options, "sponsor" or "employee")

PI names are available as a field but there is only a lab contact email, no PI specific email. Given this, I think it's best to slot the email under the lab object rather than the PI object.

parentOrganization:name - Can you confirm this property comes up as real-seeming university names for actual labs?

confirmed

sonarcloud[bot] commented 1 month ago

Quality Gate Passed Quality Gate passed

Issues
3 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud