nextflow-io / nf-schema

Functionality for working with pipeline and sample sheet schema files in Nextflow pipelines
https://nextflow-io.github.io/nf-schema/
Apache License 2.0
12 stars 21 forks source link

Deprecation and suggested replacement(s) of of `unique:` property is not a one-to-one functionality #60

Closed jfy133 closed 1 month ago

jfy133 commented 1 month ago

We noticed that the functionality provided by the removed of items: property does not actually get replaced by the uniqueItems and uniqueEntries fields.

We can almost get around this with placing a allOf field at the top level of the schema, and listing each column that should be independently validated with uniqueEntries, such as:

    "allOf": [
        { "uniqueEntries": ["id"] },
        { "uniqueEntries": ["fastq_dna"] },
        { "uniqueEntries": ["fastq_aa"] }
    ]
However  if _any_ of those columns violate uniqueness, the error message reports all of the columns being non-unique in independent errors (and still saying 'combinations'

```
* --input (test_duplicate_id.csv): Validation of file failed:
    -> Entry 2: Detected non-unique combination of the following fields: [id]
    -> Entry 2: Detected non-unique combination of the following fields: [fastq_dna]
    -> Entry 2: Detected non-unique combination of the following fields: [fastq_aa]
    -> Value does not match against the schemas at indexes [0, 1, 2]
 ```

Example schema:

{
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "$id": "https://raw.githubusercontent.com/nf-core/createtaxdb/master/assets/schema_input.json",
    "title": "nf-core/createtaxdb pipeline - params.input schema",
    "description": "Schema for the file provided with params.input",
    "type": "array",
    "items": {
        "type": "object",
        "properties": {
            "id": {
                "type": ["string", "integer"],
                "pattern": "^\\S+$",
                "errorMessage": "Sequence reference name must be provided and cannot contain spaces",
                "meta": ["id"]
            },
            "taxid": {
                "type": "integer",
                "errorMessage": "Please provide a valid taxonomic ID in integer format",
                "meta": ["taxid"]
            },
            "fasta_dna": {
                "anyOf": [
                    {
                        "type": "string",
                        "format": "file-path",
                        "pattern": "^\\S+\\.(fasta|fas|fa|fna)(\\.gz)?$"
                    },
                    {
                        "type": "string",
                        "maxLength": 0
                    }
                ],
                "uniqueItems": true,
                "exists": true,
                "format": "file-path"
            },
            "fasta_aa": {
                "anyOf": [
                    {
                        "type": "string",
                        "format": "file-path",
                        "pattern": "^\\S+\\.(fasta|fas|fa|faa)(\\.gz)?$"
                    },
                    {
                        "type": "string",
                        "maxLength": 0
                    }
                ],
                "uniqueItems": true,
                "exists": true,
                "format": "file-path"
            }
        },
        "required": ["id", "taxid"],
        "anyOf": [{ "required": ["fasta_dna"] }, { "required": ["fasta_aa"] }]
    },
    "allOf": [{ "uniqueEntries": ["id"] }, { "uniqueEntries": ["fastq_dna"] }, { "uniqueEntries": ["fastq_aa"] }]
}

Example broken csv:

id,taxid,fasta_dna,fasta_aa
Severe_acute_respiratory_syndrome_coronavirus_2,2697049,https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/fasta/sarscov2.fasta,https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/fasta/sarscov2.faa
Severe_acute_respiratory_syndrome_coronavirus_2,26970499,https://github.com/nf-core/test-datasets/blob/modules/data/genomics/prokaryotes/bacteroides_fragilis/genome/genome.fna.gz,https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/fasta/sarscov2.faa
Haemophilus_influenzae,727,https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/fasta/haemophilus_influenzae.fna.gz,

Where there is one duplicate in id and one duplicate in fasta_aa

nvnieuwk commented 1 month ago

Duplicate of #61