BLOBTOOLS subworkflow: GENERATE_CONFIG module

alxndrdiaz commented 1 year ago

PR checklist

[x] This comment contains a description of changes (with reason).
[x] If you've fixed a bug or added code that should be tested, add tests!
[x] If you've added a new tool - have you followed the pipeline conventions in the contribution docs- [ ] If necessary, also make a PR on the nf-core/blobtoolkit branch on the nf-core/test-datasets repository.
[x] Make sure your code lints (nf-core lint).
[x] Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
[x] Usage Documentation in docs/usage.md is updated.
[x] Output Documentation in docs/output.md is updated.
[x] CHANGELOG.md is updated.
[x] README.md is updated (including new tool citations and authors/contributors).

alxndrdiaz commented 1 year ago

This PR adds the GENERATE_CONFIG module which is the first script run in the BLOBTOOLS subworkflow. The output of this module is exported to results/blobtoolkit/GCA_922984935.2/config.yaml (small test dataset):

assembly:
  accession: GCA_922984935.2
  alias: mMelMel3.2 paternal haplotype
  bioproject: PRJEB49353
  biosample: SAMEA7524400
  file: ./GCA_922984935.2/assembly/GCA_922984935.2.fasta.gz
  level: chromosome
  prefix: CAKLPM02
  scaffold-count: 538
  span: 2738694574
  url: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/922/984/935/GCA_922984935.2_mMelMel3.2_paternal_haplotype/GCA_922984935.2_mMelMel3.2_paternal_haplotype_genomic.fna.gz
busco:
  basal_lineages:
  - eukaryota_odb10
  - bacteria_odb10
  - archaea_odb10
  download_dir: ./busco
  lineages:
  - carnivora_odb10
  - laurasiatheria_odb10
  - eutheria_odb10
  - mammalia_odb10
  - tetrapoda_odb10
  - vertebrata_odb10
  - metazoa_odb10
  - eukaryota_odb10
  - bacteria_odb10
  - archaea_odb10
fields:
  categories:
    file: ./GCA_922984935.2/assembly/GCA_922984935.2.categories.tsv
  synonyms:
    file: ./GCA_922984935.2/assembly/GCA_922984935.2.synonyms.tsv
    prefix: insdc
reads:
  paired: []
  single: []
revision: 0
settings:
  blast_chunk: 100000
  blast_max_chunks: 10
  blast_min_length: 1000
  blast_overlap: 0
  stats_chunk: 1000
  stats_windows:
  - 0.1
  - 0.01
  - 100000
  - 1000000
  taxdump: ./taxdump
  tmp: /tmp
similarity:
  blastn:
    name: nt
    path: ./nt
  defaults:
    evalue: 1.0e-10
    import_evalue: 1.0e-25
    max_target_seqs: 10
    taxrule: buscogenes
  diamond_blastp:
    import_max_target_seqs: 100000
    name: reference_proteomes
    path: ./uniprot
    taxrule: blastp=buscogenes
  diamond_blastx:
    name: reference_proteomes
    path: ./uniprot
taxon:
  class: Mammalia
  family: Mustelidae
  genus: Meles
  kingdom: Metazoa
  name: Meles meles
  order: Carnivora
  phylum: Chordata
  superkingdom: Eukaryota
  taxid: '9662'
version: 1

alxndrdiaz commented 1 year ago

Updated required documentation: docs/output.md and README.md, docs/usage.md andCHANGELOG.md were not modified, but I guess they should be modified before release.

alxndrdiaz commented 1 year ago

I'm not comfortable with this PR. If you look at the script behind blobtoolkit-pipeline generate-config it works this way:

1. It needs an accession number (`GCA_*`) as input:

   1. It means it only works with public assemblies, whereas this pipeline here _will_ be used routinely on draft assemblies for which there isn't such an accession yet.
   2. You're assuming the input fasta is named according to that accession number (that's why you had to remove `unmasked` and `subset`) which won't necessarily be the case.

2. The script looks up various databases (ENA, GoaT, etc) using that accession number to find out information about that species, from which it then derives taxonomy information, the list of Busco lineages, other accession numbers (BioSamples, etc), and other things. This is redundant with what the pipeline already does by itself. There is the risk that those two implementations (the nextflow modules and `blobtoolkit-pipeline generate-config`) would output something different (esp. because the pipeline queries NCBI, which is never 100% identical to ENA).

3. The script outputs values for parameters, like e-values, window-stats parameters, etc, that will be ignored, because Nextflow takes its config from `nextflow.config`. In fact, they could even be different, and that would be really confusing for anyone looking at the output of the pipeline.

I can see a few solutions:

1. Run `blobtoolkit-pipeline generate-config`, modify the yaml file according to the values from `nextflow.config`, and use it in the pipeline (i.e. replacing goat-search, etc).

2. Make the pipeline generate itself a yaml file from the Nextflow parameters and outputs (i.e. the list of lineages from goat-search). This would be done with a custom script, and would ideally output the smallest yaml file that's needed for the blobtools subworkflow.

3. Make the yaml file an input of the Nextflow pipeline, which replaces / overrides `nextflow.config`. Nextflow has a `-params-file` option that may work well for that ?

I'm not comfortable with this PR. If you look at the script behind blobtoolkit-pipeline generate-config it works this way:

1. It needs an accession number (`GCA_*`) as input:

   1. It means it only works with public assemblies, whereas this pipeline here _will_ be used routinely on draft assemblies for which there isn't such an accession yet.
   2. You're assuming the input fasta is named according to that accession number (that's why you had to remove `unmasked` and `subset`) which won't necessarily be the case.

2. The script looks up various databases (ENA, GoaT, etc) using that accession number to find out information about that species, from which it then derives taxonomy information, the list of Busco lineages, other accession numbers (BioSamples, etc), and other things. This is redundant with what the pipeline already does by itself. There is the risk that those two implementations (the nextflow modules and `blobtoolkit-pipeline generate-config`) would output something different (esp. because the pipeline queries NCBI, which is never 100% identical to ENA).

3. The script outputs values for parameters, like e-values, window-stats parameters, etc, that will be ignored, because Nextflow takes its config from `nextflow.config`. In fact, they could even be different, and that would be really confusing for anyone looking at the output of the pipeline.

I can see a few solutions:

1. Run `blobtoolkit-pipeline generate-config`, modify the yaml file according to the values from `nextflow.config`, and use it in the pipeline (i.e. replacing goat-search, etc).

2. Make the pipeline generate itself a yaml file from the Nextflow parameters and outputs (i.e. the list of lineages from goat-search). This would be done with a custom script, and would ideally output the smallest yaml file that's needed for the blobtools subworkflow.

3. Make the yaml file an input of the Nextflow pipeline, which replaces / overrides `nextflow.config`. Nextflow has a `-params-file` option that may work well for that ?

It makes sense to make this config file generation more general so it can be used with draft assemblies. And maybe it would be better to generate it using a custom script because otherwise you need to provide a YAML file with a lot of parameters and paths that are already generated by the pipeline (such as busco lineages from goat, etc). I still need to write the script and also to check which are the mandatory values it should contain.

alxndrdiaz commented 1 year ago

Instead of modifying the blobtoolkit-pipeline generate-config script or the output file from the existing Nextflow module the following approach will be used: (1) A GCA accession (to run with the module) or a YAML should be provided, (2) the blobtools subworflow checks that any of these files exist and throws an error if both are present. If a YAML file is present it should only contain the mandatory parameters and paths to run the pipeline.

alxndrdiaz commented 1 year ago

Description of parameters YAML file / accesion (required by blobtools subworkflow): https://github.com/sanger-tol/blobtoolkit/blob/generate_config/docs/parameters.md#inputoutput-options Not sure if software_versions and stats_windows have to be included since they can be obtained from versions.yml and the Nextflow config file containing module and subworkflow parameters.

priyanka-surana commented 1 year ago

@muffato Is the updated logic closer to the changes you requested? I wanted to get a partial review for the generate_config module before we add the 2 other modules which depend on this.

muffato commented 1 year ago

It looks fine to me. Thanks !

sanger-tol / blobtoolkit

BLOBTOOLS subworkflow: GENERATE_CONFIG module #44

PR checklist