pepkit / eido

Validator for PEP objects
http://eido.databio.org
BSD 2-Clause "Simplified" License
4 stars 6 forks source link

Check existence of files in subsample_table #26

Closed jma1991 closed 1 year ago

jma1991 commented 3 years ago

How can I write a schema which will validate the existence of files specified in the subsample table? I almost always specify read1 and read2 in the subsample table because I rarely have just a single pair of FASTQ files per sample (i.e. usually from multiple lanes). The schema below (adapted from the examples page) passes validation without any FASTQ files present so I assume when the read1 and read2 attributes are arrays it doesn't check for the existence of each item in the array?

description: Schema
imports:
  - http://schema.databio.org/pep/2.0.0.yaml
properties:
  samples:
    type: array
    items:
      type: object
      properties:
        sample_name: 
          type: string
          description: "Name of the sample"
        read1:
          anyOf:
            - type: string
              description: "Fastq file for read 1"
            - type: array
              items:
                type: string
        read2:
          anyOf:
            - type: string
              description: "Fastq file for read 2 (for paired-end experiments)"
            - type: array
              items:
                type: string
      required_files:
        - read1
        - read2
      files:
        - read1
        - read2
      required:
        - sample_name
        - read1
        - read2
required:
  - samples
nsheff commented 3 years ago

That's a good point. I am not sure we considered the case of the required files being in subsamples.

One tricky thing is, should it require them all or just any specified files to exist in that case?

@stolarczyk have you thought through this?

stolarczyk commented 3 years ago

Hey, thanks for reaching out!

How can I write a schema which will validate the existence of files specified in the subsample table?

It doesn't matter where the sample attributes come from from the eido perspective. Eido processes a peppy.Project object that has been altered by the modifiers and with attributes subsample table added.

passes validation without any FASTQ files present

I think I got the expected result with the set of files listed below. Is that similar to your setup?

pep_version: 2.0.0
sample_table: sample_table.csv
subsample_table: subsample_table.csv

sample_table.csv

sample_name,attr1,attr2
sample1,val1,val2
sample2,val1,val2

subsample_table.csv (the read1 and read2 attributes are missing for one of sample2)

sample_name,read1,read2
sample1,read_file1,read_file2

schema.yml (just copied yours)

description: Schema
imports:
  - http://schema.databio.org/pep/2.0.0.yaml
properties:
  samples:
    type: array
    items:
      type: object
      properties:
        sample_name: 
          type: string
          description: "Name of the sample"
        read1:
          anyOf:
            - type: string
              description: "Fastq file for read 1"
            - type: array
              items:
                type: string
        read2:
          anyOf:
            - type: string
              description: "Fastq file for read 2 (for paired-end experiments)"
            - type: array
              items:
                type: string
      required_files:
        - read1
        - read2
      files:
        - read1
        - read2
      required:
        - sample_name
        - read1
        - read2
required:
  - samples
stolarczyk commented 3 years ago
[mstolarczyk@MichalsMBP eido]: eido validate cfg.yml -s schema.yml --exclude-case

Traceback (most recent call last):
  File "/usr/local/Cellar/python@3.9/3.9.1_6/Frameworks/Python.framework/Versions/3.9/bin/eido", line 8, in <module>
    sys.exit(main())
  File "/usr/local/Cellar/python@3.9/3.9.1_6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/eido/cli.py", line 62, in main
    validate_project(p, args.schema, args.exclude_case)
  File "/usr/local/Cellar/python@3.9/3.9.1_6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/eido/eido.py", line 114, in validate_project
    _validate_object(project_dict, _preprocess_schema(schema_dict), exclude_case)
  File "/usr/local/Cellar/python@3.9/3.9.1_6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/eido/eido.py", line 99, in _validate_object
    raise jsonschema.exceptions.ValidationError(e.message)
jsonschema.exceptions.ValidationError: 'read1' is a required property
nsheff commented 3 years ago

The point is to error if the attribute is present, but the file is missing. You've shown that it works if the attribute is missing.

We're talking here about the required_files functionality that eido adds to JSONschema.

edit: I think your example actually proves the point of where it's not working... This should also fail because read_file1 is not an existing file, but we've specified required_files: read1, etc.

jma1991 commented 3 years ago

The point is to error if the attribute is present, but the file is missing. You've shown that it works if the attribute is missing.

We're talking here about the required_files functionality that eido adds to JSONschema.

Yes, exactly. I have created a reproducible example for further testing: https://www.dropbox.com/sh/jp6i915lnbgh1xu/AAD9Heo0mUR7KqVG1SHqSNaha?dl=0

stolarczyk commented 3 years ago

ok, right..

But are you sure Snakemake even uses the file existence checking feature of eido?

I've skimmed through this PR and I only see a call to eido.validate_project function, which only asserts the existence of the attribute. eido.validate_inputs is the one that uses the required_files schema attribute.

jma1991 commented 3 years ago

My version of eido (0.1.4) reports no errors if I run through the command line either:

$ eido validate config.yaml -s schema.yaml --exclude-case
Validation successful
nsheff commented 3 years ago

I've skimmed through this PR and I only see a call to eido.validate_project function, which only asserts the existence of the attribute. eido.validate_inputs is the one that uses the required_files schema attribute.

Well, this might be something we need to update in Snakemake... but it won't matter if there's no way for eido to assert subsample files exist, right? So, there are two problems, one on each side.

stolarczyk commented 2 years ago

there's no way for eido to assert subsample files exist

That's not true. As I've mentioned above, by using the validate_inputs function it is possible to capture missing files that are sample attributes coming from a subsample_table.

The function behaves differently though, which was dictated by our use case in looper. Instead of raising an exception, it records missing files. Here's an example:

validate_inputs(sample=p.samples[0], schema="schema.yaml")

1 input files missing, job input size was not calculated accurately

Out[5]: 
{'missing': ['/Users/mstolarczyk/Desktop/testing/eido/file11A.txt'],
 'required_inputs': {'/Users/mstolarczyk/Desktop/testing/eido/file11A.txt',
  '/Users/mstolarczyk/Desktop/testing/eido/file11B.txt',
  '/Users/mstolarczyk/Desktop/testing/eido/file12A.txt',
  '/Users/mstolarczyk/Desktop/testing/eido/file12B.txt'},
 'all_inputs': {'/Users/mstolarczyk/Desktop/testing/eido/file11A.txt',
  '/Users/mstolarczyk/Desktop/testing/eido/file11B.txt',
  '/Users/mstolarczyk/Desktop/testing/eido/file12A.txt',
  '/Users/mstolarczyk/Desktop/testing/eido/file12B.txt'},
 'input_file_size': 0.0}

So based on this output it is the responsibility of the client software to decide what to do in case one or more files are missing.


I used the same PEP and schema as above for this example except the subsample_table.csv, which now lists the actual paths:

sample_name,subsample_name,read1,read2
sample1,A,/Users/mstolarczyk/Desktop/testing/eido/file11A.txt,/Users/mstolarczyk/Desktop/testing/eido/file12A.txt
sample1,B,/Users/mstolarczyk/Desktop/testing/eido/file11B.txt,/Users/mstolarczyk/Desktop/testing/eido/file12B.txt
sample2,A,/Users/mstolarczyk/Desktop/testing/eido/file21A.txt,/Users/mstolarczyk/Desktop/testing/eido/file22A.txt
sample2,B,/Users/mstolarczyk/Desktop/testing/eido/file21B.txt,/Users/mstolarczyk/Desktop/testing/eido/file22B.txt
~ ls

file11B.txt         file12B.txt         file21B.txt         file22B.txt         schema.yaml
cfg.yaml            file12A.txt         file21A.txt         file22A.txt         sample_table.csv    subsample_table.csv

note the missing file11A.txt file above

nsheff commented 2 years ago

I don't quite understand -- You're saying it passes the validation, even when a file listed as required is not present, right? To me, that's clearly a problem. But your PR addresses this, then? So now if the subsample file doesn't exist it will not validate, correct?

Redmar-van-den-Berg commented 1 year ago

Is there any news on this? I run into the same problem using eido directly, where the validation passes even though the files are missing. This would be a really nice feature to have in eido itself.

nsheff commented 1 year ago

This should work on eido v0.2.0 to be released today.