Closed jma1991 closed 1 year ago
That's a good point. I am not sure we considered the case of the required files being in subsamples.
One tricky thing is, should it require them all or just any specified files to exist in that case?
@stolarczyk have you thought through this?
Hey, thanks for reaching out!
How can I write a schema which will validate the existence of files specified in the subsample table?
It doesn't matter where the sample attributes come from from the eido perspective. Eido processes a peppy.Project
object that has been altered by the modifiers and with attributes subsample table added.
passes validation without any FASTQ files present
I think I got the expected result with the set of files listed below. Is that similar to your setup?
pep_version: 2.0.0
sample_table: sample_table.csv
subsample_table: subsample_table.csv
sample_table.csv
sample_name,attr1,attr2
sample1,val1,val2
sample2,val1,val2
subsample_table.csv (the read1
and read2
attributes are missing for one of sample2
)
sample_name,read1,read2
sample1,read_file1,read_file2
schema.yml (just copied yours)
description: Schema
imports:
- http://schema.databio.org/pep/2.0.0.yaml
properties:
samples:
type: array
items:
type: object
properties:
sample_name:
type: string
description: "Name of the sample"
read1:
anyOf:
- type: string
description: "Fastq file for read 1"
- type: array
items:
type: string
read2:
anyOf:
- type: string
description: "Fastq file for read 2 (for paired-end experiments)"
- type: array
items:
type: string
required_files:
- read1
- read2
files:
- read1
- read2
required:
- sample_name
- read1
- read2
required:
- samples
[mstolarczyk@MichalsMBP eido]: eido validate cfg.yml -s schema.yml --exclude-case
Traceback (most recent call last):
File "/usr/local/Cellar/python@3.9/3.9.1_6/Frameworks/Python.framework/Versions/3.9/bin/eido", line 8, in <module>
sys.exit(main())
File "/usr/local/Cellar/python@3.9/3.9.1_6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/eido/cli.py", line 62, in main
validate_project(p, args.schema, args.exclude_case)
File "/usr/local/Cellar/python@3.9/3.9.1_6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/eido/eido.py", line 114, in validate_project
_validate_object(project_dict, _preprocess_schema(schema_dict), exclude_case)
File "/usr/local/Cellar/python@3.9/3.9.1_6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/eido/eido.py", line 99, in _validate_object
raise jsonschema.exceptions.ValidationError(e.message)
jsonschema.exceptions.ValidationError: 'read1' is a required property
The point is to error if the attribute is present, but the file is missing. You've shown that it works if the attribute is missing.
We're talking here about the required_files
functionality that eido adds to JSONschema.
edit: I think your example actually proves the point of where it's not working... This should also fail because read_file1
is not an existing file, but we've specified required_files: read1
, etc.
The point is to error if the attribute is present, but the file is missing. You've shown that it works if the attribute is missing.
We're talking here about the
required_files
functionality that eido adds to JSONschema.
Yes, exactly. I have created a reproducible example for further testing: https://www.dropbox.com/sh/jp6i915lnbgh1xu/AAD9Heo0mUR7KqVG1SHqSNaha?dl=0
ok, right..
But are you sure Snakemake even uses the file existence checking feature of eido?
I've skimmed through this PR and I only see a call to eido.validate_project
function, which only asserts the existence of the attribute. eido.validate_inputs
is the one that uses the required_files
schema attribute.
My version of eido (0.1.4) reports no errors if I run through the command line either:
$ eido validate config.yaml -s schema.yaml --exclude-case
Validation successful
I've skimmed through this PR and I only see a call to eido.validate_project function, which only asserts the existence of the attribute. eido.validate_inputs is the one that uses the required_files schema attribute.
Well, this might be something we need to update in Snakemake... but it won't matter if there's no way for eido to assert subsample files exist, right? So, there are two problems, one on each side.
there's no way for eido to assert subsample files exist
That's not true. As I've mentioned above, by using the validate_inputs
function it is possible to capture missing files that are sample attributes coming from a subsample_table.
The function behaves differently though, which was dictated by our use case in looper. Instead of raising an exception, it records missing files. Here's an example:
validate_inputs(sample=p.samples[0], schema="schema.yaml")
1 input files missing, job input size was not calculated accurately
Out[5]:
{'missing': ['/Users/mstolarczyk/Desktop/testing/eido/file11A.txt'],
'required_inputs': {'/Users/mstolarczyk/Desktop/testing/eido/file11A.txt',
'/Users/mstolarczyk/Desktop/testing/eido/file11B.txt',
'/Users/mstolarczyk/Desktop/testing/eido/file12A.txt',
'/Users/mstolarczyk/Desktop/testing/eido/file12B.txt'},
'all_inputs': {'/Users/mstolarczyk/Desktop/testing/eido/file11A.txt',
'/Users/mstolarczyk/Desktop/testing/eido/file11B.txt',
'/Users/mstolarczyk/Desktop/testing/eido/file12A.txt',
'/Users/mstolarczyk/Desktop/testing/eido/file12B.txt'},
'input_file_size': 0.0}
So based on this output it is the responsibility of the client software to decide what to do in case one or more files are missing.
I used the same PEP and schema as above for this example except the subsample_table.csv, which now lists the actual paths:
sample_name,subsample_name,read1,read2
sample1,A,/Users/mstolarczyk/Desktop/testing/eido/file11A.txt,/Users/mstolarczyk/Desktop/testing/eido/file12A.txt
sample1,B,/Users/mstolarczyk/Desktop/testing/eido/file11B.txt,/Users/mstolarczyk/Desktop/testing/eido/file12B.txt
sample2,A,/Users/mstolarczyk/Desktop/testing/eido/file21A.txt,/Users/mstolarczyk/Desktop/testing/eido/file22A.txt
sample2,B,/Users/mstolarczyk/Desktop/testing/eido/file21B.txt,/Users/mstolarczyk/Desktop/testing/eido/file22B.txt
~ ls
file11B.txt file12B.txt file21B.txt file22B.txt schema.yaml
cfg.yaml file12A.txt file21A.txt file22A.txt sample_table.csv subsample_table.csv
note the missing file11A.txt
file above
I don't quite understand -- You're saying it passes the validation, even when a file listed as required is not present, right? To me, that's clearly a problem. But your PR addresses this, then? So now if the subsample file doesn't exist it will not validate, correct?
Is there any news on this? I run into the same problem using eido directly, where the validation passes even though the files are missing. This would be a really nice feature to have in eido itself.
This should work on eido v0.2.0 to be released today.
How can I write a schema which will validate the existence of files specified in the subsample table? I almost always specify read1 and read2 in the subsample table because I rarely have just a single pair of FASTQ files per sample (i.e. usually from multiple lanes). The schema below (adapted from the examples page) passes validation without any FASTQ files present so I assume when the read1 and read2 attributes are arrays it doesn't check for the existence of each item in the array?