sanger-tol / genomenote

Nextflow DSL2 pipeline to generate a Genome Note, including assembly statistics, quality metrics, and Hi-C contact maps. This workflow is part of the Tree of Life production suite.
https://pipelines.tol.sanger.ac.uk/genomenote
MIT License
24 stars 6 forks source link

Allow running of metatdata subworkflow on multiple specimen IDs #114

Closed BethYates closed 4 months ago

BethYates commented 5 months ago

Description of feature

A genome note provides meta data related to the specimen used to produce the genome assembly, the specimen used to generate HiC data and the specimen used to produce RNA-Seq data. These may all be different specimens. The genome note pipeline should be able to take in each of these IDs and run the metadata subworkflow on each, recording the relevant data for use in the publication

BethYates commented 5 months ago

The genome_metadata subworkflow will be introduced in version 2.0 of the genome note pipeline and is currently only present on the public_dev branch of the repository. To work on this issue you will need to create a feature branch from the public_dev branch rather than the dev branch. Pushing development for the 2.0 release to the public_dev branch allows us to keep the dev branch clean in case we need to push some bug fixes from there to the main release branch.

BethYates commented 4 months ago

To close this issue:

  1. Rename the biosample parameter to biosample_wgs and add two additional parameters biosample_hic and biosample_rna to nextflow.config the value of these should be set to null
  2. Update test.config, test_full.config to contain values for the new parameters that you have added/changed. For the test profile biosample_hic="SAMEA7520846" and biosample_rna="SAMEA7521081" for the test_full profile biosample_hic="SAMEA7519968" and biosample_rna=null
  3. Modify genome_metadata.nf so that all of the files in ch_file_list that contains a "BIOSAMPLE_ACCESSION" are added to the file_list channel for each of the biosample parameters. In some cases (as in the test_full profile) biosample_rna will be null and should be ignored - the code needs to handle this
  4. Modify the metadata in genome_metadata.nf to include a biosample_type, the value for this should be either "WGS", "HIC", "RNA" or "" if the file is not related to a biosample.
  5. Modify run_wget.nf to include the biosample_type in the output file name where the biosample_type is not an empty string.
  6. Modify parse_metadata.nf to include the biosample_type in the output file name where the biosample_type is not an empty string.
  7. Modify parse_xml_ena_biosample.py to extract the biosample_type from the output file name passed to the script. In For the HiC and RNASeq biosample accession use this biosample_type to prefix the parameter names written to the output file (e.g. for the biosample_hic IDENTIFIER would become HIC_IDENTIFIER, for the biosample_rna SPECIMEN_ID would become RNA_SPECIMEN_ID) 9. Update docs/usage.md and nextflow_schema.json to include the new/renamed parameters