modularize references - Githubissues

Currently, it's annoying to have to dig through the references: section of the config, especially if you only want one or a handful of genomes.

I propose an "include" mechanism, where individual genomes' references config info are stored and maintained in separate files, and then included into the main config.

So you would maintain a file like this:

# include/reference_configs/Drosophila_melanogaster.yaml
fly:
  default:
    fasta: https://url.gz
    indexes:
      - bowtie2
...

and another one:

# include/reference_configs/Homo_sapiens.yaml
human:
  gencode-v25:
    fasta: https://url.gz
    indexes:
      - bowtie2
...

And then include them both like this:

# workflows/chipseq/config/config.yaml
include_references:
    - '../../include/reference_configs/Drosophila_melanogaster.yaml'
    - '../../include/reference_configs/Homo_sapiens.yaml'

This would be in addition to the existing mechanism. So the result of these operations would be to update the references: dict with the included files. This would raise an error if an existing key was already found in the main config file.

The major advantage of this is easier maintenance of each species' references data and easier toggling (i.e., commenting out single lines) of what genomes you want included.

lcdb / lcdb-wf

modularize references #122