Closed alxndrdiaz closed 1 year ago
This PR adds the GENERATE_CONFIG module which is the first script run in the BLOBTOOLS subworkflow. The output of this module is exported to results/blobtoolkit/GCA_922984935.2/config.yaml
(small test dataset):
assembly:
accession: GCA_922984935.2
alias: mMelMel3.2 paternal haplotype
bioproject: PRJEB49353
biosample: SAMEA7524400
file: ./GCA_922984935.2/assembly/GCA_922984935.2.fasta.gz
level: chromosome
prefix: CAKLPM02
scaffold-count: 538
span: 2738694574
url: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/922/984/935/GCA_922984935.2_mMelMel3.2_paternal_haplotype/GCA_922984935.2_mMelMel3.2_paternal_haplotype_genomic.fna.gz
busco:
basal_lineages:
- eukaryota_odb10
- bacteria_odb10
- archaea_odb10
download_dir: ./busco
lineages:
- carnivora_odb10
- laurasiatheria_odb10
- eutheria_odb10
- mammalia_odb10
- tetrapoda_odb10
- vertebrata_odb10
- metazoa_odb10
- eukaryota_odb10
- bacteria_odb10
- archaea_odb10
fields:
categories:
file: ./GCA_922984935.2/assembly/GCA_922984935.2.categories.tsv
synonyms:
file: ./GCA_922984935.2/assembly/GCA_922984935.2.synonyms.tsv
prefix: insdc
reads:
paired: []
single: []
revision: 0
settings:
blast_chunk: 100000
blast_max_chunks: 10
blast_min_length: 1000
blast_overlap: 0
stats_chunk: 1000
stats_windows:
- 0.1
- 0.01
- 100000
- 1000000
taxdump: ./taxdump
tmp: /tmp
similarity:
blastn:
name: nt
path: ./nt
defaults:
evalue: 1.0e-10
import_evalue: 1.0e-25
max_target_seqs: 10
taxrule: buscogenes
diamond_blastp:
import_max_target_seqs: 100000
name: reference_proteomes
path: ./uniprot
taxrule: blastp=buscogenes
diamond_blastx:
name: reference_proteomes
path: ./uniprot
taxon:
class: Mammalia
family: Mustelidae
genus: Meles
kingdom: Metazoa
name: Meles meles
order: Carnivora
phylum: Chordata
superkingdom: Eukaryota
taxid: '9662'
version: 1
Updated required documentation: docs/output.md
and README.md
, docs/usage.md
andCHANGELOG.md
were not modified, but I guess they should be modified before release.
I'm not comfortable with this PR. If you look at the script behind
blobtoolkit-pipeline generate-config
it works this way:1. It needs an accession number (`GCA_*`) as input: 1. It means it only works with public assemblies, whereas this pipeline here _will_ be used routinely on draft assemblies for which there isn't such an accession yet. 2. You're assuming the input fasta is named according to that accession number (that's why you had to remove `unmasked` and `subset`) which won't necessarily be the case. 2. The script looks up various databases (ENA, GoaT, etc) using that accession number to find out information about that species, from which it then derives taxonomy information, the list of Busco lineages, other accession numbers (BioSamples, etc), and other things. This is redundant with what the pipeline already does by itself. There is the risk that those two implementations (the nextflow modules and `blobtoolkit-pipeline generate-config`) would output something different (esp. because the pipeline queries NCBI, which is never 100% identical to ENA). 3. The script outputs values for parameters, like e-values, window-stats parameters, etc, that will be ignored, because Nextflow takes its config from `nextflow.config`. In fact, they could even be different, and that would be really confusing for anyone looking at the output of the pipeline.
I can see a few solutions:
1. Run `blobtoolkit-pipeline generate-config`, modify the yaml file according to the values from `nextflow.config`, and use it in the pipeline (i.e. replacing goat-search, etc). 2. Make the pipeline generate itself a yaml file from the Nextflow parameters and outputs (i.e. the list of lineages from goat-search). This would be done with a custom script, and would ideally output the smallest yaml file that's needed for the blobtools subworkflow. 3. Make the yaml file an input of the Nextflow pipeline, which replaces / overrides `nextflow.config`. Nextflow has a `-params-file` option that may work well for that ?
I'm not comfortable with this PR. If you look at the script behind
blobtoolkit-pipeline generate-config
it works this way:1. It needs an accession number (`GCA_*`) as input: 1. It means it only works with public assemblies, whereas this pipeline here _will_ be used routinely on draft assemblies for which there isn't such an accession yet. 2. You're assuming the input fasta is named according to that accession number (that's why you had to remove `unmasked` and `subset`) which won't necessarily be the case. 2. The script looks up various databases (ENA, GoaT, etc) using that accession number to find out information about that species, from which it then derives taxonomy information, the list of Busco lineages, other accession numbers (BioSamples, etc), and other things. This is redundant with what the pipeline already does by itself. There is the risk that those two implementations (the nextflow modules and `blobtoolkit-pipeline generate-config`) would output something different (esp. because the pipeline queries NCBI, which is never 100% identical to ENA). 3. The script outputs values for parameters, like e-values, window-stats parameters, etc, that will be ignored, because Nextflow takes its config from `nextflow.config`. In fact, they could even be different, and that would be really confusing for anyone looking at the output of the pipeline.
I can see a few solutions:
1. Run `blobtoolkit-pipeline generate-config`, modify the yaml file according to the values from `nextflow.config`, and use it in the pipeline (i.e. replacing goat-search, etc). 2. Make the pipeline generate itself a yaml file from the Nextflow parameters and outputs (i.e. the list of lineages from goat-search). This would be done with a custom script, and would ideally output the smallest yaml file that's needed for the blobtools subworkflow. 3. Make the yaml file an input of the Nextflow pipeline, which replaces / overrides `nextflow.config`. Nextflow has a `-params-file` option that may work well for that ?
It makes sense to make this config file generation more general so it can be used with draft assemblies. And maybe it would be better to generate it using a custom script because otherwise you need to provide a YAML file with a lot of parameters and paths that are already generated by the pipeline (such as busco lineages from goat, etc). I still need to write the script and also to check which are the mandatory values it should contain.
Instead of modifying the blobtoolkit-pipeline generate-config
script or the output file from the existing Nextflow module the following approach will be used: (1) A GCA accession (to run with the module) or a YAML should be provided, (2) the blobtools
subworflow checks that any of these files exist and throws an error if both are present. If a YAML file is present it should only contain the mandatory parameters and paths to run the pipeline.
Description of parameters YAML file / accesion (required by blobtools subworkflow): https://github.com/sanger-tol/blobtoolkit/blob/generate_config/docs/parameters.md#inputoutput-options
Not sure if software_versions
and stats_windows
have to be included since they can be obtained from versions.yml
and the Nextflow config file containing module and subworkflow parameters.
@muffato Is the updated logic closer to the changes you requested? I wanted to get a partial review for the generate_config
module before we add the 2 other modules which depend on this.
It looks fine to me. Thanks !
PR checklist
nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).