HuBMAP data upload guidelines and instructions for checking that uploads adhere to those guidelines. Assay documentation is on Github Pages.
HuBMAP has three distinct metadata processes:
Before we can write code to validate a particular assay type, there are some prerequisites:
When all the parts are finalized,
Once approved, both the CEDAR Metadata Template (metadata schema) and the list of files (directory schema) are fixed in a particular version. The metadata for a particular assay type needs to be consistent for all datasets, as does the set of files which comprise a dataset. Edits to descriptions are welcome, as are improved validations.
If a more significant change is necessary, a new version is required, and when the older form is no longer acceptable, the schema should be deprecated.
HuBMAP HIVE members: For questions about the stability of metadata, contact Nils Gehlenborg (@ngehlenborg), or add him as a reviewer on the PR. For the stability of directory structures, contact Phil Blood (@pdblood).
To validate your metadata TSV files, use the HuBMAP Metadata Spreadsheet Validator. This tool is a web-based application that will categorize any errors in your spreadsheet and provide help fixing those errors. More detailed instructions about using the tool can be found in the Spreadsheet Validator Documentation.
Checkout the repo and install dependencies:
python --version # Should be Python3.
git clone https://github.com/hubmapconsortium/ingest-validation-tools.git
cd ingest-validation-tools
# Optionally, set up venv or conda, then:
pip install -r requirements.txt
src/validate_upload.py --help
You should see the documention for validate_upload.py
Now run it against one of the included examples, giving the path to an upload directory:
src/validate_upload.py \
--local_directory examples/dataset-examples/bad-tsv-formats/upload \
--no_url_checks \
--output as_text
Note: URL checking is not supported via validate_upload.py
at this time, and is disabled with the use of the --no_url_checks
flag. Please ensure that any fields containing a HuBMAP ID (such as parent-sample_id
) or an ORCID (orcid
) are accurate.
You should now see this (extensive) error message. This example TSV has been constructed with a mistake in every column, just to demonstrate the checks which are available. Hopefully, more often your experience will be like this:
src/validate_upload.py \
--local_directory examples/dataset-examples/good-codex-akoya-metadata-v1/upload \
--no_url_checks
No errors!
Documentation and metadata TSV templates for each assay type are here.
Additional plugin tests can also be run. These additional tests confirm that the files themselves are valid, not just that the directory structures are correct. These additional tests are in a separate repo, and have their own dependencies.
Starting from ingest-validation-tools...
cd ..
git clone https://github.com/hubmapconsortium/ingest-validation-tests.git
cd ingest-validation-tests
pip install -r requirements.txt
Back to ingest-validation-tools...
cd ../ingest-validation-tools
Failing example, see README.md
src/validate_upload.py \
--local_directory examples/plugin-tests/expected-failure/upload \
--run_plugins \
--no_url_checks \
--plugin_directory ../ingest-validation-tests/src/ingest_validation_tests/
An example of the core error-reporting functionality underlying validate-upload.py
:
upload = Upload(directory_path=path)
report = ErrorReport(upload)
print(report.as_text())
(If it would be useful for this to be installable with pip
, please file an issue.)
To make contributions, checkout the project, cd, venv, and then:
pip install -r requirements.txt
pip install -r requirements-dev.txt
brew install parallel # On macOS
apt-get install parallel # On Ubuntu
./test.sh
After making tweaks to the schema, you will need to regenerate the docs: The test error message will tell you what to do.
This repo uses GitHub Actions to check formatting and linting of code using black, isort, and flake8. Especially before submitting a PR, make sure your code is compliant. Run the following from the base ingest-validation-tools
directory:
black --line-length 99 .
isort --profile black --multi-line 3 .
flake8
Integrating black and potentially isort/flake8 with your editor may allow you to skip this step.
For releases we're just using git tags:
$ git tag v0.0.x
$ git push origin v0.0.x
Checking in the built documentation is not the typical approach, but has worked well for this project:
Data upload to HuBMAP is composed of discrete phases:
Uploads are based on directories containing at a minimum:
*-metadata.tsv
files.The type of a metadata TSV is determined by reading the first row.
The antibodies_path
(for applicable types), contributors_path
, and data_path
are relative to the location of the TSV.
The antibodies and contributors TSV will typically be at the top level of the upload,
but if they are applicable to only a single dataset, they can be placed within that dataset's extras/
directory.
You can validate your upload directory locally, then upload it to Globus, and the same validation will be run there.