Unify casing of config values across preprocessing and ingest

fengelniederhammer commented 1 month ago

TLDR

I accidentally misconfigured the ingest due to inconsistent casing of config values and it took way to long to find out what's wrong.

What happened

I was setting up a Loculus instance with H5N1, i.e. a segmented organism. Some parts of the config repeats itself, so I copied it. Turns out: it didn't work. My config looks something like this:

  h5n1:
    <<: *defaultOrganismConfig
    schema:
      <<: *schema
      organismName: "Influenza A/H5N1"
      nucleotideSequences: ["seg1", "seg2", "seg3", "seg4", "seg5", "seg6", "seg7", "seg8"]
    preprocessing:
      - <<: *preprocessing
        configFile:
          <<: *preprocessingConfigFile
          log_level: DEBUG
          nextclade_dataset_name: "community/genspectrum/flu/h5n1"
          nextclade_dataset_server: "https://raw.githubusercontent.com/anna-parker/nextclade_data/h5n1/data_output"
          nucleotideSequences: ["seg1", "seg2", "seg3", "seg4", "seg5", "seg6", "seg7", "seg8"]
          genes: ["PB2", "PB1", "PA", "PAX", "HA", "NA", "NP", "M1", "M2", "NS1", "NS2"]
    ingest:
      <<: *ingest
      configFile:
        taxon_id: 197911
        filter_fasta_headers: "(H5N1)"
        nucleotideSequences: ["seg1", "seg2", "seg3", "seg4", "seg5", "seg6", "seg7", "seg8"]
        nextclade_dataset_server: "https://raw.githubusercontent.com/anna-parker/nextclade_data/h5n1/data_output"
        nextclade_dataset_name: "community/genspectrum/flu/h5n1"
    referenceGenomes:
      nucleotideSequences:
        - name: "seg1"
          sequence: "..."
        - ...
      genes: [...]

"Nucleotide sequences" have to be configured in 4 places, but ingest needs it in snake case whereas the others require camel case.

This has several issues / possible improvements:

Casing is inconsistent:
- Some config values are camel case (organismName), some are snake case (nextclade_dataset_name)
- It's even inconsistent within a single config key (nucleotide_sequences vs nucleotideSequences)
The ingest pipeline doesn't report that I misconfigured it. Nucleotide sequences are necessary, but it has a default set (nucleotide_sequences: ["main"]). I had to read the code to see that the resulting config is written to a file, and debug the running pod to check the file which was
```
[...]
nucleotideSequences:
```
seg1
seg2
seg3
seg4
seg5
seg6
seg7
seg8 nucleotide_sequences:
main [...] segmented: false [...]
Possible improvements on this end:
- It would be good throw errors or at least write warnings to the log when uncountering unknown config values.
- It would be good to log the resulting config. I think it's relevant enough for maintainers to see it. Maybe DEBUG log, so that it doesn't clutter the log once the instance is running?
There is quite a bit of repetition. In principle one could generate the nucleotideSequences from the reference genomes (which must be provided anyway). It should always be in sync anyway. I can't think of a use case where the preprocessing pipeline should have a different reference genome than SILO or the ingest.

anna-parker commented 1 month ago

The whole config is a mess and needs to be redone

fengelniederhammer commented 1 month ago

I also just noticed many (or all?) Snakemake steps already log their config which is quite useful. The whole config is not logged, but maybe it's already enough at it is.

loculus-project / loculus