Nextflow DSL2 pipeline to generate a Genome Note, including assembly statistics, quality metrics, and Hi-C contact maps. This workflow is part of the Tree of Life production suite.
Currently, the pipeline orders the sequences in the contact map by decreasing size (FILTER_GENOME process). Whilst it is reasonable for many species, there are cases where it's not the best choice:
Some species have chromosome names that predate genome sequencing and don't follow the sizes in base-pairs. This is the case of many primates, starting with human, chimpanzee, gibbon, etc.
Traditionally, sex chromosomes are placed after the autosomes.
Very large genomes like Meconema thalassinum have sequences too large to be in INSDC as one piece (the limit is 2^31-1 bp) and are split into chromosome 1_1, chromsome 1_2, etc. On the map, the fragments have to be put right next to each other in the right order.
To overcome this, I propose reading the sequence report from NCBI datasets, which is already pulled by the pipeline and is ordered correctly, and copy the order into the contact map.
Description of feature
Currently, the pipeline orders the sequences in the contact map by decreasing size (
FILTER_GENOME
process). Whilst it is reasonable for many species, there are cases where it's not the best choice:To overcome this, I propose reading the sequence report from NCBI datasets, which is already pulled by the pipeline and is ordered correctly, and copy the order into the contact map.