pepkit / peppy

Project metadata manager for PEPs in Python
https://pep.databio.org/peppy
BSD 2-Clause "Simplified" License
37 stars 13 forks source link

`peppy.Project.to_dict()` creates unwanted NaN values #417

Closed rafalstepien closed 1 year ago

rafalstepien commented 1 year ago

When running eido validate --st-index sample -f samplesheet.csv -s samplesheet_schema.yaml -e command with following schema:

description: A schema for validation of samplesheet.csv for taxprofiler pipeline.
imports:
  - https://schema.databio.org/pep/2.1.0.yaml
properties:
  samples:
    type: array
    items:
      type: object
      properties:
        sample:
          type: string
          description: "Sample identifier."
          pattern: "^\\S*$"
        run_accession:
          type: string
          description: "Run accession number."
        instrument_platform:
          type: string
          description: "Name of the platform that sequenced the samples."
          enum: ["ABI_SOLID", "BGISEQ", "CAPILLARY", "COMPLETE_GENOMICS", "DNBSEQ", "HELICOS", "ILLUMINA", "ION_TORRENT", "LS454", "OXFORD_NANOPORE", "PACBIO_SMRT"]
        fastq1:
          type: string
          description: "FASTQ file for read 1."
          pattern: "^[\\S]+.(fq\\.gz|fastq\\.gz)$"
        fastq2:
          type: string
          description: "FASTQ file for read 2."
          pattern: "^[\\S]+.(fq\\.gz|fastq\\.gz)$"
        fasta:
          type: string
          description: "Path to FASTA file."
          pattern: "^[\\S]+.(fa\\.gz|fasta\\.gz)$"
      required:
        - sample
        - run_accession
        - instrument_platform
      files:
        - fastq1
        - fastq2
        - fasta
required:
  - samples

On the samplesheet attached samplesheet.csv, then peppy.Project.to_dict creates following structure:

{'pep_version': '2.1.0', '_samples': [{'sample': '2611', 'instrument_platform': 'ILLUMINA', 'run_accession': 'ERR5766174', 'fastq_1': 'NaN', 'fastq_2': 'NaN', 'fasta': 'https://raw.githubusercontent.com/nf-core/test-datasets/taxprofiler/data/fasta/ERX5474930_ERR5766174_1.fa.gz'}, {'sample': '2614', 'instrument_platform': 'ILLUMINA', 'run_accession': 'ERR5766176', 'fastq_1': 'https://raw.githubusercontent.com/nf-core/test-datasets/taxprofiler/data/fastq/ERX5474932_ERR5766176_1.fastq.gz', 'fastq_2': 'https://raw.githubusercontent.com/nf-core/test-datasets/taxprofiler/data/fastq/ERX5474932_ERR5766176_2.fastq.gz', 'fasta': 'NaN'}, {'sample': '2615', 'instrument_platform': 'ILLUMINA', 'run_accession': 'ERR5766176_B', 'fastq_1': 'https://raw.githubusercontent.com/nf-core/test-datasets/taxprofiler/data/fastq/ERX5474932_ERR5766176_B_1.fastq.gz', 'fastq_2': 'https://raw.githubusercontent.com/nf-core/test-datasets/taxprofiler/data/fastq/ERX5474932_ERR5766176_B_2.fastq.gz', 'fasta': 'NaN'}, {'sample': '2612', 'instrument_platform': 'ILLUMINA', 'run_accession': 'ERR5766180', 'fastq_1': 'https://raw.githubusercontent.com/nf-core/test-datasets/taxprofiler/data/fastq/ERX5474936_ERR5766180_1.fastq.gz', 'fastq_2': 'NaN', 'fasta': 'NaN'}, {'sample': '2613', 'instrument_platform': 'ILLUMINA', 'run_accession': 'ERR5766181', 'fastq_1': 'https://raw.githubusercontent.com/nf-core/test-datasets/taxprofiler/data/fastq/ERX5474937_ERR5766181_1.fastq.gz', 'fastq_2': 'https://raw.githubusercontent.com/nf-core/test-datasets/taxprofiler/data/fastq/ERX5474937_ERR5766181_2.fastq.gz', 'fasta': 'NaN'}, {'sample': 'ERR3201952', 'instrument_platform': 'OXFORD_NANOPORE', 'run_accession': 'ERR3201952', 'fastq_1': 'https://raw.githubusercontent.com/nf-core/test-datasets/taxprofiler/data/fastq/ERR3201952.fastq.gz', 'fastq_2': 'NaN', 'fasta': 'NaN'}]}

This is not correct, because we don't want to have NaNs in our data structures.