hubmapconsortium / ingest-validation-tools

HuBMAP data submission guidelines, and tools which check that submissions adhere to those guidelines.
MIT License
8 stars 18 forks source link

Inquiry about using the tool for validation #999

Closed icaoberg closed 3 years ago

icaoberg commented 3 years ago

I am trying to validate an LC-MS Top-Down submission from Northwestern (for reference @jswelling has't not ingested this dataset nor @cebriggs7135 has not validated it using Airflow). When I use the command

python3 src/validate_upload.py --local_directory '7f1fd7b9c8c3745fcab037a2fa37f5b9/' --dataset_ignore_globs 'extras' --dataset_ignore_globs '*metadata.tsv' --dataset_ignore_globs 'validation_report.txt'

I get this message from the tool

/hive/users/hive/ingest-validation-tools/lib/python3.7/site-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.25.11) or chardet (4.0.0) doesn't match a supported version!
  RequestsDependencyWarning)
There are no references from any TSV to Proteomics.
There are no references from any TSV to extras.
There are no references from any TSV to validation_report.txt.
Hint: If validation fails because of extra whitespace in the TSV, try:
src/cleanup_whitespace.py --tsv_in original.tsv --tsv_out clean.tsv.

and I don't know how to interpret it. I ran cleanup_whitespace.py as suggested, just in case, and I get the same error.

The directory structure is

$ tree .
.
├── contributors.tsv
├── extras
├── metadata.tsv
├── Proteomics
│   ├── ID_search_results
│   │   └── TDMS_Proteoform_Results.csv
│   └── raw_data
│       ├── 20200707_rmi049_75umPLRPS_Kidney_GF10pc_VAN0003LK32_biorep01_techrep01.raw
│       ├── 20200707_rmi049_75umPLRPS_Kidney_GF10pc_VAN0003LK32_biorep01_techrep02.raw
│       ├── 20200707_rmi049_75umPLRPS_Kidney_GF10pc_VAN0003LK32_biorep02_techrep01.raw
│       ├── 20200707_rmi049_75umPLRPS_Kidney_GF10pc_VAN0003LK32_biorep02_techrep02.raw
│       ├── 20200707_rmi049_75umPLRPS_Kidney_GF10pc_VAN0003LK32_biorep03_techrep01.raw
│       ├── 20200707_rmi049_75umPLRPS_Kidney_GF10pc_VAN0003LK32_biorep03_techrep02.raw
│       ├── 20200707_rmi049_75umPLRPS_Kidney_GF10pc_VAN0003LK32_biorep04_techrep01.raw
│       ├── 20200707_rmi049_75umPLRPS_Kidney_GF10pc_VAN0003LK32_biorep04_techrep02.raw
│       ├── 20200707_rmi049_75umPLRPS_Kidney_GF10pc_VAN0003LK32_biorep05_techrep01.raw
│       └── 20200707_rmi049_75umPLRPS_Kidney_GF10pc_VAN0003LK32_biorep05_techrep02.raw
└── validation_report.txt

4 directories, 14 files
mccalluc commented 3 years ago

Upload directories should have this structure:

Upload directory structure

In the canonical form, there should only be TSVs and data directories at the top level. (As part of ingest, the metadata TSV is broken up into single lines, and each single line is put in the corresponding dataset, and the validation is invoked with a different set of flags... but I think you want the documented, canonical form.)

To fix:

icaoberg commented 3 years ago

@mccalluc so the issue still remains even after your suggestions.

python3 src/validate_upload.py --local_directory 7f1fd7b9c8c3745fcab037a2fa37f5b9/ --dataset_ignore_globs extras --dataset_ignore_globs '*metadata.tsv' --dataset_ignore_globs validation_report.txt

/hive/users/hive/ingest-validation-tools/lib/python3.7/site-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.25.11) or chardet (4.0.0) doesn't match a supported version!
  RequestsDependencyWarning)
There are no references from any TSV to Proteomics.
Hint: If validation fails because of extra whitespace in the TSV, try:
src/cleanup_whitespace.py --tsv_in original.tsv --tsv_out clean.tsv.

I don't know if there is an issue with the metadata, unlikely. This file was created by the data provider with @cebriggs7135

mccalluc commented 3 years ago

I would prefer that on-going conversations be moved to slack: I will respond to things more quickly, and it's a better place for things where open and closed may be fuzzy.

cebriggs7135 commented 3 years ago

@mccalluc @icaoberg Moved discussion to Slack, per Chuck's request.

mccalluc commented 3 years ago

See slack. Please do not reopen.