Scoping the Psych-DS Validator

Context

This is a draft of the requirements/architecture document for the Psych-DS validator.

The Psych-DS team is preparing to begin development on our suite of validation tools, which will include a web app, a node package, a python package, and an R package. We want the tool to be open source from the ground up, so we would like to encourage community members and collaborators to contribute feedback, suggestions, and discussions in the form of Github issues.

Target audience
- The target audience for a validator tool for the Psych-DS (Dataset) standard would likely be:
  - Researchers in psychology and cognitive neuroscience who collect and analyze behavioral data. Having a tool to validate their datasets against the Psych-DS schema standardizes the structure and ensures compatibility with other tools in the ecosystem.
  - Developers of software libraries and applications for behavioral data analysis like MATLAB, Python pandas, R, etc. They can integrate the validator to check if datasets adhere to the Psych-DS standards before ingesting them into their tools.
  - Cloud platforms and repositories for sharing behavioral research data. A validator helps ensure datasets uploaded to these repositories are standardized and analysis-ready.
  - Publishers of behavioral research papers. A Psych-DS validation tool could be integrated into the submission pipeline to verify accompanying datasets meet the standard. This encourages reproducibility and rigor.
  - Educators teaching behavioral data analysis methods and best practices to students. Having students validate their own datasets with the tool teaches standardization.
  - Data scientists/analysts responsible for wrangling and cleaning heterogeneous behavioral data into a consistent format for downstream use.
- Overall, any researcher, software developer, platform, or institution involved with curating, sharing, or analyzing behavioral research data can benefit from having a simple way to validate against a common standard. The validator makes it easier to ensure quality and interoperability of these valuable scientific datasets.
Scope
- Core Scope:
  - Metadata validation - Confirm that metadata objects are present at the top level of the dataset and as sidecars to raw datafiles in subdirectories. Validate for JSON-LD schema with Schema.org Dataset type.
  - File structure validation - Confirm that data directory is organized according to Psych-DS specifications, with legally allowed subdirectories and appropriate filetypes, subdirectories and metadata files within
  - Data File validation - Confirm that variables defined within metadata have corresponding columns within the data files themselves, and that files are in allowed formats.
- Extended Scope:
  - Dataset conversion - Automatically convert a Psych-DS compliant dataset to a BIDS compliant dataset
  - Automatic download - Create functionality within platforms/tools like PsychoPy, jsPsych, and Lookit to automatically download response data in valid Psych-DS format
  - Integration with CEDAR wizard - Create template within CEDAR wizard that will allow users to easily create valid metadata jsons
- Optional Scope:
  - Integrate CEDAR wizard directly into web validator
  - Integrate dynamic tutorials into web validator
  - Create pipeline using web validator to upload validated datasets to OSF repository with Psych-DS tag

Contributions

Collaboration on Psych-DS is expected to follow our existing Code of Conduct.

To contribute to conversation, feel free to add comments to this issue or any of the mentioned currently open issues. You can also create your own issue by using the "Scoping" template after clicking "add new issue". If you have any questions about the process, feel free to add them as a comment on this issue, and we'll get back to you.

If you contribute in some way other than interacting with this issue, please also leave a comment below so we can add your name to this list of contributors! (Capturing both PRs and non-code contributions to this project is a key goal!!)

Melissa Kline Struhl
Brian Leonard
Martin Seehuus
Russ Poldrack
David Moreau
Josh de Leeuw
Eduard Klapwijk

Codebase

The validator tools themselves will live in this repository; additional repositories may be created at https://github.com/psych-ds/ for modularity, different validator tools etc.
The Psych-DS “core” repo (https://github.com/psych-ds/psych-DS) contains project orientation and the initial linkML schema that Brian is currently working on
Once migration is complete, (1) the linkML schema plus (2) The node CLI tool will constitute the reference implementation of Psych-DS

Documentation

Currently, the ‘gold standard’ record of Psych-DS is the big google doc that has been the center of our work for the past several years. This will not be a good or maintainable solution in the long term!
Following BIDS’s model, we plan to import the text of the specification itself into a ReadtheDocs instance; doing this piece by piece in tandem with linkML schema implementation.
Once migration is complete, the ReadtheDocs site will serve as the canonical documentation/reference for the specification.
In addition to the spec itself, this ReadtheDocs site should contain links to all validator software along with tools/resources for using & getting started with Psych-DS.
We should follow a defined process for de-accessioning/migrating material that’s currently in the big google doc into the new documentation and/or schema files. (See psych-ds/psych-DS#29)

Issues

When significant work is being done outside of the Github repos, we should maintain a GH issue that indicates that this is the case, to avoid losing track of that work. See e.g. https://github.com/psych-ds/psych-DS/issues/30

For this repo, we'll be using the https://github.com/psych-ds/psych-DS/labels/Scoping label to indicate issues where community discussion at this stage should take place, with additional labels for further categorization. These labels, for the time being, are limited to:

https://github.com/psych-ds/psych-DS/labels/Technology%20Stack - Discussions regarding the specific languages/frameworks/integrations used for the validator tools.
https://github.com/psych-ds/psych-DS/labels/Feature%20Requests - Discussions of features that would be useful for your purposes, but don't appear in the requirements doc.
https://github.com/psych-ds/psych-DS/labels/Validation - Discussions about the validation process itself, and how it ought to be implemented.
https://github.com/psych-ds/psych-DS/labels/File%20Structure - Discussions about the file structure
https://github.com/psych-ds/psych-DS/labels/Metadata - Discussions about metadata

Here is a complete list of smaller issues relevant to scoping the validator:

3
4
8
psych-ds/psych-DS#33

Psych-DS Validator Requirements

0. Available resources

What needs to launch with the beta versions of the CLI + web browser tools?

CLI tool
Website
LinkML documentation "catalogue"
Tutorial/step-by-step guide including CEDAR wizard (video??)
See BIDS docs for inspiration on tutorials/beginner guides
Communication plan for the launch - listserv messages, at least initial thoughts about example datasets/user testing sprints
More canonical datasets and more communication around uploading validated datasets to some centralized repository

1. User Requirements

User personas
- Non-coder Researchers
  - This researcher in the behavioral sciences would be interested in producing datasets that conform to Psych-DS criteria, but is mostly used to accessible GUI tools like RStudio, PsychoPy, Qualtrics, Excel, etc. They would require a validator tool that is either simple to use through a publicly hosted web app, or an installable package through a GUI that they are already familiar with, such as RStudio. They would likely not be interested in managing complex custom options, and would want to trust the system to validate their dataset in a comprehensive, but default manner. They would be less interested in documentation about how the tool works, and more interested in how it is used. By keeping these non-coder friendly apps simple and transparent and giving their functionality parity with the CLI tools, we hope to enable researchers to access the benefits of both Psych-DS itself and further tools that build upon it (e.g. automatic survey scoring, repository data submission).
- Coder Researchers
  - This researcher is more comfortable with scripting, tweaking code, and using command line tools. They would also be interested in easy, low-fuss interfaces, but they would also want GUI-less, CLI options across a number of frameworks, so they can integrate Psych-DS validation into their automatic data pipelines. They would be interested in a suite of custom options with which to tweak the validation function, as well as extensive documentation of how the tool operates (rather than how it is used).
- Managers of Research Support Software
  - These individuals and organizations would have a vested interested in the success of research support tools like PsychoPy, jsPsych, Pavlovia, ExperimentRunner, etc, and would be interested in a tool that is clearly defined, well made, and modular enough to be integrated into their own tools/platforms.
Non-functional requirements (performance, security, etc.)
- All users will have a natural interest in maintaining the anonymity and security of their datasets and participants. They will require all tools to be agnostic to the actual contents of the datasets, and to not require any uploading of files to work. The tools should all be transparent about this aspect, to reduce any concerns that might come up.
- Based on the simplicity of the validation function, these tools should all manage to work more or less instantaneously

2. UI requirements

UI discussed in full within this issue

3. Validation Process

General
- Schema requirements are modeled in this linkML schema model. The CLI app unpacks a json version of this schema and validates by constructing a fileTree of the input directory and checking that filetree against the rules and objects that it derives from the schema. The current validation function is located here
In-browser validation
- A minimal, most likely one-page web app will suffice, using a browserify-ed version of the CLI app to perform the validation function
Psych-DS validation
- Regardless of whether there is a ready-made validation function to fit every aspect of the psych-DS schema already provided within LinkML-runtime or Ajv, our aim is to treat the schema models as sources of “ground truth” about the rules of the specification. In other words, even if certain elements of the dataset are not validated by a ready-made third party process, we will still design our own custom functions to grab the content of the schema specs from the schema model files.
- dataset_description.json
  - definition of metadata fields that may appear
  - list of fields collated with their requirement levels
  - the “variableMeasured” variable should enumerate all of the column names that appear at least once across the various datafiles in the set.
  - dataset_description.json must be present in the root directory
  - metadata files in the wrong directory location should receive an error more specific than just NOT_INCLUDED
  - variableMeasured value must be either an array of strings or an array of propertyValue objects with names
  - dataset_description.json must a valid json-LD file
  - required fields within metadata such as name, description, must resolve to valid schema URIs
- /data folder
  - Within the root directory, there should be a folder called “data”.
  - /data may include subdirectories to contain either alternate (such as raw) forms of the data, or for the purposes of sub-categorizing data and applying certain metadata elements hierarchically. Data subdirectories can be nested an arbitrary number of times.
  - the /data directory or one of its subdirectories must include at least one valid datafile
- File Naming
  - all data file names must consist of a series of key-value pairs, the "_data" suffix, and a ".csv" extension.
  - key-value pairs, aka “keywords”, consist of a label and a value separated by a hyphen. each key must be entirely lowercase, but the same does not apply for values.
  - keywords are encouraged to be limited to this fixed set: study, site, subject, session, task, condition, trial, stimulus, and description.
  - Different keywords in a series are separated by an underscore. The series of keywords and the suffix are also separated by an underscore
- File contents
  - Datafile must be stored in csv format.
  - There must be a comma-separated header that spans all columns used (implicit under this and this)
  - String values containing commas must be escaped using double quotes.
  - Each row of the csv must contain as many delimiters (commas) as the header line.
  - Columns within the datafile must be a proper subset of those declared in the variableMeasured attribute of the metadata.
  - If you have a column that uniquely identifies each single row of a dataset explicitly, it should be named ‘row_id’. A column named row_id must contain unique values in every row.
- Encoding
  - utf-8 encoding must be used for all metadata and csv datafiles.
    4. Functional Requirements
Application architecture
- The validator suite will include a react web application, a node.js package, a python package, and an R package. Each tool in the suite will rely on the same central validation schemas, which will be stored in the linkML schema language.
  - The front-end architecture is mostly up in the air; since it's just going to be serving a single static webpage, there isn't a requirement for anything too specific/specialized. For parity's sake, it may be simplest to just re-use the deno framework for the front-end
  - The node.js package will be the first element to be completed, and will serve as the "canonical" validator function for the time being
  - The R package, most likely, will just be a Rstudio GUI built around the node.js package
  - The Python version, in anticipation of integration with psychoPy, will aim to be completely offline, not requiring any internet connection in order to run. This is not true of the node.js version, whose deno framework implicitly requires internet access.
Front-end components
- Front end issue
Back-end components
- There is no back-end
Workflows
- Orchestration issue
Database schema
- No database functionality will be necessary for this application
  5. Non-Functional Requirements
Performance, scalability
- Orchestration issue
Security
- Security Issue
Usability
- The app will have a clear, minimal UI that prominently displays the input button and validation options for the validation function
- All versions of the validator will have clear documentation, simple, minimal interfaces, and clear feedback about errors/invalid inputs
Accessibility
- All validator tools will be available either as publicly hosted websites or open source repositories on github.
  6. Testing Requirements
Functional testing
- Unit testing issue, COMPLETE
Non-functional testing
- Community Testing Issue
  7. Deployment Requirements
Hosting
- Hosting will be handled by Google App Engine
Installation
- Installation/Availability issue
Migration
- No data migration is necessary since there is no data.
Backup and recovery
- Likewise, no backup is necessary, as all code will be hosted securely in git repositories, and there is no data to back up.

psych-ds / psychds-validator