Closed subwaystation closed 2 years ago
I think it sounds reasonable to add all these as test data for nf-core/modules, as long as we keep the file sizes as small as possible.
I think only wfmash is able to read zipped files. Else we need to leave these uncompressed.
Sounds good, I think we might want to save dealing with compressed files until later.
All test data for all modules is present.
Is your feature request related to a problem? Please describe
We don't have test data available in the https://github.com/nf-core/test-datasets repository for most of the modules https://github.com/nf-core/pangenome/issues/70 that are required for this pipeline. Also taking over some information from @heuermh https://github.com/nf-core/pangenome/issues/70#issuecomment-988995417. That is we have test data to run the pipeline as a whole https://github.com/nf-core/test-datasets/tree/pangenome.
Quoting @heuermh
The GFA mentioned above is an assembly graph. It does not make sense to use this GFA as test data, because we need an alignment graph in the GFA following the variation graph model. In the assembly GFA the paths and sequences are not related to each other, while in a variation graph GFA they are. Else the downstream testing does not make sense.
Describe the solution you'd like
I think we need a new key
pangenome
in the list of possible data set types. I would suggesttest-datasets/data/genomics/homo_sapiens/pangenome
. As the files should allow a smokey execution of the module, I would suggest the very light https://github.com/pangenome/pggb/blob/master/data/HLA/V-352962.fa.gz human HLA genes as a test data set. A list of possible files could bepangenome.fa
: A FASTA file which contains several related genomes for the input ofwfmash
.pangenome.paf
: A PAF file which contains the pairwise alignments of related genomes. Ideally, it was generated bywfmash
. It is consumed byseqwish
.pangenome.seqwish.gfa
: A GFA file which contains the pangenome graph induced byseqwish
encoded in the variation graph model. It is consumed bysmoothxg
,odgi sort
,odgi build
, andodgi stats
.pangenome.smoothxg.gfa
: A GFA file which contains thesmoothxg
smoothed pangenome graph. It is consumed byodgi sort
, andGFAffix
.pangenome.og
: A variation graph encoded in the binary ODGI format. It is consumed byodgi view
.pangenome.gfaffix.gfa
: A GFA file which was normalized withgfaffix
. It is consumed byvg deconstruct
,odgi layout
,odgi draw
, andodgi viz
.pangenome.lay
: A binary file which holds the 2D graph layout produced byodgi layout
. Input forodgi draw
.I think only
wfmash
is able to read zipped files. Else we need to leave these uncompressed.Describe alternatives you've considered
Use existing data. I don't think this makes sense, because existing FASTAs usually only have 1 sequence incorporated, so we can't build a pangenome. Existing assembly GFA files do not encode a variation graph.
Additional context
The
nf-core/pangenome
pipeline brings some new challenges tonf-core
.meta
parameter....Therefore, I want to put to discussion, if we can add another key to the test data sets.