loculus-project / loculus

An open-source software package to power microbial genomic databases
https://loculus.org
GNU Affero General Public License v3.0
37 stars 2 forks source link

Unify casing of config values across preprocessing and ingest #2933

Open fengelniederhammer opened 1 month ago

fengelniederhammer commented 1 month ago

TLDR

I accidentally misconfigured the ingest due to inconsistent casing of config values and it took way to long to find out what's wrong.

What happened

I was setting up a Loculus instance with H5N1, i.e. a segmented organism. Some parts of the config repeats itself, so I copied it. Turns out: it didn't work. My config looks something like this:

  h5n1:
    <<: *defaultOrganismConfig
    schema:
      <<: *schema
      organismName: "Influenza A/H5N1"
      nucleotideSequences: ["seg1", "seg2", "seg3", "seg4", "seg5", "seg6", "seg7", "seg8"]
    preprocessing:
      - <<: *preprocessing
        configFile:
          <<: *preprocessingConfigFile
          log_level: DEBUG
          nextclade_dataset_name: "community/genspectrum/flu/h5n1"
          nextclade_dataset_server: "https://raw.githubusercontent.com/anna-parker/nextclade_data/h5n1/data_output"
          nucleotideSequences: ["seg1", "seg2", "seg3", "seg4", "seg5", "seg6", "seg7", "seg8"]
          genes: ["PB2", "PB1", "PA", "PAX", "HA", "NA", "NP", "M1", "M2", "NS1", "NS2"]
    ingest:
      <<: *ingest
      configFile:
        taxon_id: 197911
        filter_fasta_headers: "(H5N1)"
        nucleotideSequences: ["seg1", "seg2", "seg3", "seg4", "seg5", "seg6", "seg7", "seg8"]
        nextclade_dataset_server: "https://raw.githubusercontent.com/anna-parker/nextclade_data/h5n1/data_output"
        nextclade_dataset_name: "community/genspectrum/flu/h5n1"
    referenceGenomes:
      nucleotideSequences:
        - name: "seg1"
          sequence: "..."
        - ...
      genes: [...]

"Nucleotide sequences" have to be configured in 4 places, but ingest needs it in snake case whereas the others require camel case.

This has several issues / possible improvements:

anna-parker commented 1 month ago

The whole config is a mess and needs to be redone

fengelniederhammer commented 1 month ago

I also just noticed many (or all?) Snakemake steps already log their config which is quite useful. The whole config is not logged, but maybe it's already enough at it is.