Test data for the pipeline

subwaystation commented 2 years ago

Is your feature request related to a problem? Please describe

We don't have test data available in the https://github.com/nf-core/test-datasets repository for most of the modules https://github.com/nf-core/pangenome/issues/70 that are required for this pipeline. Also taking over some information from @heuermh https://github.com/nf-core/pangenome/issues/70#issuecomment-988995417. That is we have test data to run the pipeline as a whole https://github.com/nf-core/test-datasets/tree/pangenome.

Quoting @heuermh

In nf-core/modules the test data are only useful for smoke testing the modules (i.e. making sure they run with the correct inputs and outputs and don't explode). There are GFA files at https://github.com/nf-core/test-datasets/tree/modules/data/genomics/sarscov2/illumina/gfa to use. What else might we need?

The GFA mentioned above is an assembly graph. It does not make sense to use this GFA as test data, because we need an alignment graph in the GFA following the variation graph model. In the assembly GFA the paths and sequences are not related to each other, while in a variation graph GFA they are. Else the downstream testing does not make sense.

Describe the solution you'd like

I think we need a new key pangenome in the list of possible data set types. I would suggest test-datasets/data/genomics/homo_sapiens/pangenome. As the files should allow a smokey execution of the module, I would suggest the very light https://github.com/pangenome/pggb/blob/master/data/HLA/V-352962.fa.gz human HLA genes as a test data set. A list of possible files could be

pangenome.fa: A FASTA file which contains several related genomes for the input of wfmash.
pangenome.paf: A PAF file which contains the pairwise alignments of related genomes. Ideally, it was generated by wfmash. It is consumed by seqwish.
pangenome.seqwish.gfa: A GFA file which contains the pangenome graph induced by seqwish encoded in the variation graph model. It is consumed by smoothxg, odgi sort, odgi build, and odgi stats.
pangenome.smoothxg.gfa: A GFA file which contains the smoothxg smoothed pangenome graph. It is consumed by odgi sort, and GFAffix.
pangenome.og: A variation graph encoded in the binary ODGI format. It is consumed by odgi view.
pangenome.gfaffix.gfa: A GFA file which was normalized with gfaffix. It is consumed by vg deconstruct, odgi layout, odgi draw, and odgi viz.
pangenome.lay: A binary file which holds the 2D graph layout produced by odgi layout. Input for odgi draw.

I think only wfmash is able to read zipped files. Else we need to leave these uncompressed.

Describe alternatives you've considered

Use existing data. I don't think this makes sense, because existing FASTAs usually only have 1 sequence incorporated, so we can't build a pangenome. Existing assembly GFA files do not encode a variation graph.

Additional context

The nf-core/pangenome pipeline brings some new challenges to nf-core.

The input is a FASTA file. And not FASTQ file(s). Therefore I doubt we will need a spreadsheet as input. Wondering how this will affect the meta parameter....
Lot's of the file types produced and consumed by the tools are currently not present in any test data repository.
The data structures holding this novel data types require that certain data types follow a pangenome standard, e.g. a GFA can only be used sensible if it encodes a variation graph.

Therefore, I want to put to discussion, if we can add another key to the test data sets.

heuermh commented 2 years ago

I think it sounds reasonable to add all these as test data for nf-core/modules, as long as we keep the file sizes as small as possible.

I think only wfmash is able to read zipped files. Else we need to leave these uncompressed.

Sounds good, I think we might want to save dealing with compressed files until later.

subwaystation commented 2 years ago

All test data for all modules is present.

nf-core / pangenome