This is a draft of the requirements/architecture document for the Psych-DS validator.
The Psych-DS team is preparing to begin development on our suite of validation tools, which will include a web app, a node package, a python package, and an R package. We want the tool to be open source from the ground up, so we would like to encourage community members and collaborators to contribute feedback, suggestions, and discussions in the form of Github issues.
Target audience
The target audience for a validator tool for the Psych-DS (Dataset) standard would likely be:
Researchers in psychology and cognitive neuroscience who collect and analyze behavioral data. Having a tool to validate their datasets against the Psych-DS schema standardizes the structure and ensures compatibility with other tools in the ecosystem.
Developers of software libraries and applications for behavioral data analysis like MATLAB, Python pandas, R, etc. They can integrate the validator to check if datasets adhere to the Psych-DS standards before ingesting them into their tools.
Cloud platforms and repositories for sharing behavioral research data. A validator helps ensure datasets uploaded to these repositories are standardized and analysis-ready.
Publishers of behavioral research papers. A Psych-DS validation tool could be integrated into the submission pipeline to verify accompanying datasets meet the standard. This encourages reproducibility and rigor.
Educators teaching behavioral data analysis methods and best practices to students. Having students validate their own datasets with the tool teaches standardization.
Data scientists/analysts responsible for wrangling and cleaning heterogeneous behavioral data into a consistent format for downstream use.
Overall, any researcher, software developer, platform, or institution involved with curating, sharing, or analyzing behavioral research data can benefit from having a simple way to validate against a common standard. The validator makes it easier to ensure quality and interoperability of these valuable scientific datasets.
Scope
Core Scope:
Metadata validation - Confirm that metadata objects are present at the top level of the dataset and as sidecars to raw datafiles in subdirectories. Validate for JSON-LD schema with Schema.org Dataset type.
File structure validation - Confirm that data directory is organized according to Psych-DS specifications, with legally allowed subdirectories and appropriate filetypes, subdirectories and metadata files within
Data File validation - Confirm that variables defined within metadata have corresponding columns within the data files themselves, and that files are in allowed formats.
Extended Scope:
Dataset conversion - Automatically convert a Psych-DS compliant dataset to a BIDS compliant dataset
Automatic download - Create functionality within platforms/tools like PsychoPy, jsPsych, and Lookit to automatically download response data in valid Psych-DS format
Integration with CEDAR wizard - Create template within CEDAR wizard that will allow users to easily create valid metadata jsons
Optional Scope:
Integrate CEDAR wizard directly into web validator
Integrate dynamic tutorials into web validator
Create pipeline using web validator to upload validated datasets to OSF repository with Psych-DS tag
Contributions
Collaboration on Psych-DS is expected to follow our existing Code of Conduct.
To contribute to conversation, feel free to add comments to this issue or any of the mentioned currently open issues. You can also create your own issue by using the "Scoping" template after clicking "add new issue". If you have any questions about the process, feel free to add them as a comment on this issue, and we'll get back to you.
If you contribute in some way other than interacting with this issue, please also leave a comment below so we can add your name to this list of contributors! (Capturing both PRs and non-code contributions to this project is a key goal!!)
Melissa Kline Struhl
Brian Leonard
Martin Seehuus
Russ Poldrack
David Moreau
Josh de Leeuw
Eduard Klapwijk
Codebase
The validator tools themselves will live in this repository; additional repositories may be created at https://github.com/psych-ds/ for modularity, different validator tools etc.
The Psych-DS “core” repo (https://github.com/psych-ds/psych-DS) contains project orientation and the initial linkML schema that Brian is currently working on
Once migration is complete, (1) the linkML schema plus (2) The node CLI tool will constitute the reference implementation of Psych-DS
Documentation
Currently, the ‘gold standard’ record of Psych-DS is the big google doc that has been the center of our work for the past several years. This will not be a good or maintainable solution in the long term!
Following BIDS’s model, we plan to import the text of the specification itself into a ReadtheDocs instance; doing this piece by piece in tandem with linkML schema implementation.
Once migration is complete, the ReadtheDocs site will serve as the canonical documentation/reference for the specification.
In addition to the spec itself, this ReadtheDocs site should contain links to all validator software along with tools/resources for using & getting started with Psych-DS.
We should follow a defined process for de-accessioning/migrating material that’s currently in the big google doc into the new documentation and/or schema files. (See psych-ds/psych-DS#29)
Issues
When significant work is being done outside of the Github repos, we should maintain a GH issue that indicates that this is the case, to avoid losing track of that work. See e.g. https://github.com/psych-ds/psych-DS/issues/30
For this repo, we'll be using the https://github.com/psych-ds/psych-DS/labels/Scoping label to indicate issues where community discussion at this stage should take place, with additional labels for further categorization. These labels, for the time being, are limited to:
Here is a complete list of smaller issues relevant to scoping the validator:
3
4
8
psych-ds/psych-DS#33
Psych-DS Validator Requirements
0. Available resources
What needs to launch with the beta versions of the CLI + web browser tools?
CLI tool
Website
LinkML documentation "catalogue"
Tutorial/step-by-step guide including CEDAR wizard (video??)
See BIDS docs for inspiration on tutorials/beginner guides
Communication plan for the launch - listserv messages, at least initial thoughts about example datasets/user testing sprints
More canonical datasets and more communication around uploading validated datasets to some centralized repository
1. User Requirements
User personas
Non-coder Researchers
This researcher in the behavioral sciences would be interested in producing datasets that conform to Psych-DS criteria, but is mostly used to accessible GUI tools like RStudio, PsychoPy, Qualtrics, Excel, etc. They would require a validator tool that is either simple to use through a publicly hosted web app, or an installable package through a GUI that they are already familiar with, such as RStudio. They would likely not be interested in managing complex custom options, and would want to trust the system to validate their dataset in a comprehensive, but default manner. They would be less interested in documentation about how the tool works, and more interested in how it is used. By keeping these non-coder friendly apps simple and transparent and giving their functionality parity with the CLI tools, we hope to enable researchers to access the benefits of both Psych-DS itself and further tools that build upon it (e.g. automatic survey scoring, repository data submission).
Coder Researchers
This researcher is more comfortable with scripting, tweaking code, and using command line tools. They would also be interested in easy, low-fuss interfaces, but they would also want GUI-less, CLI options across a number of frameworks, so they can integrate Psych-DS validation into their automatic data pipelines. They would be interested in a suite of custom options with which to tweak the validation function, as well as extensive documentation of how the tool operates (rather than how it is used).
Managers of Research Support Software
These individuals and organizations would have a vested interested in the success of research support tools like PsychoPy, jsPsych, Pavlovia, ExperimentRunner, etc, and would be interested in a tool that is clearly defined, well made, and modular enough to be integrated into their own tools/platforms.
All users will have a natural interest in maintaining the anonymity and security of their datasets and participants. They will require all tools to be agnostic to the actual contents of the datasets, and to not require any uploading of files to work. The tools should all be transparent about this aspect, to reduce any concerns that might come up.
Based on the simplicity of the validation function, these tools should all manage to work more or less instantaneously
Schema requirements are modeled in this linkML schema model. The CLI app unpacks a json version of this schema and validates by constructing a fileTree of the input directory and checking that filetree against the rules and objects that it derives from the schema. The current validation function is located here
In-browser validation
A minimal, most likely one-page web app will suffice, using a browserify-ed version of the CLI app to perform the validation function
Psych-DS validation
Regardless of whether there is a ready-made validation function to fit every aspect of the psych-DS schema already provided within LinkML-runtime or Ajv, our aim is to treat the schema models as sources of “ground truth” about the rules of the specification. In other words, even if certain elements of the dataset are not validated by a ready-made third party process, we will still design our own custom functions to grab the content of the schema specs from the schema model files.
/data may include subdirectories to contain either alternate (such as raw) forms of the data, or for the purposes of sub-categorizing data and applying certain metadata elements hierarchically. Data subdirectories can be nested an arbitrary number of times.
key-value pairs, aka “keywords”, consist of a label and a value separated by a hyphen. each key must be entirely lowercase, but the same does not apply for values.
keywords are encouraged to be limited to this fixed set: study, site, subject, session, task, condition, trial, stimulus, and description.
Different keywords in a series are separated by an underscore. The series of keywords and the suffix are also separated by an underscore
If you have a column that uniquely identifies each single row of a dataset explicitly, it should be named ‘row_id’. A column named row_id must contain unique values in every row.
Encoding
utf-8 encoding must be used for all metadata and csv datafiles.
4. Functional Requirements
Application architecture
The validator suite will include a react web application, a node.js package, a python package, and an R package. Each tool in the suite will rely on the same central validation schemas, which will be stored in the linkML schema language.
The front-end architecture is mostly up in the air; since it's just going to be serving a single static webpage, there isn't a requirement for anything too specific/specialized. For parity's sake, it may be simplest to just re-use the deno framework for the front-end
The node.js package will be the first element to be completed, and will serve as the "canonical" validator function for the time being
Scoping the Psych-DS Validator
Context
This is a draft of the requirements/architecture document for the Psych-DS validator.
The Psych-DS team is preparing to begin development on our suite of validation tools, which will include a web app, a node package, a python package, and an R package. We want the tool to be open source from the ground up, so we would like to encourage community members and collaborators to contribute feedback, suggestions, and discussions in the form of Github issues.
Contributions
Collaboration on Psych-DS is expected to follow our existing Code of Conduct.
To contribute to conversation, feel free to add comments to this issue or any of the mentioned currently open issues. You can also create your own issue by using the "Scoping" template after clicking "add new issue". If you have any questions about the process, feel free to add them as a comment on this issue, and we'll get back to you.
If you contribute in some way other than interacting with this issue, please also leave a comment below so we can add your name to this list of contributors! (Capturing both PRs and non-code contributions to this project is a key goal!!)
Codebase
Documentation
Issues
When significant work is being done outside of the Github repos, we should maintain a GH issue that indicates that this is the case, to avoid losing track of that work. See e.g. https://github.com/psych-ds/psych-DS/issues/30
For this repo, we'll be using the https://github.com/psych-ds/psych-DS/labels/Scoping label to indicate issues where community discussion at this stage should take place, with additional labels for further categorization. These labels, for the time being, are limited to:
Here is a complete list of smaller issues relevant to scoping the validator:
3
4
8
Psych-DS Validator Requirements
0. Available resources
What needs to launch with the beta versions of the CLI + web browser tools?
1. User Requirements
2. UI requirements
UI discussed in full within this issue
3. Validation Process
4. Functional Requirements
5. Non-Functional Requirements
6. Testing Requirements
7. Deployment Requirements