nf-core / pangenome

Renders a collection of sequences into a pangenome graph. https://doi.org/10.1093/bioinformatics/btae609.
https://nf-co.re/pangenome
MIT License
74 stars 16 forks source link

Test data for the pipeline #74

Closed subwaystation closed 2 years ago

subwaystation commented 2 years ago

Is your feature request related to a problem? Please describe

We don't have test data available in the https://github.com/nf-core/test-datasets repository for most of the modules https://github.com/nf-core/pangenome/issues/70 that are required for this pipeline. Also taking over some information from @heuermh https://github.com/nf-core/pangenome/issues/70#issuecomment-988995417. That is we have test data to run the pipeline as a whole https://github.com/nf-core/test-datasets/tree/pangenome.

Quoting @heuermh

In nf-core/modules the test data are only useful for smoke testing the modules (i.e. making sure they run with the correct inputs and outputs and don't explode). There are GFA files at https://github.com/nf-core/test-datasets/tree/modules/data/genomics/sarscov2/illumina/gfa to use. What else might we need?

The GFA mentioned above is an assembly graph. It does not make sense to use this GFA as test data, because we need an alignment graph in the GFA following the variation graph model. In the assembly GFA the paths and sequences are not related to each other, while in a variation graph GFA they are. Else the downstream testing does not make sense.

Describe the solution you'd like

I think we need a new key pangenome in the list of possible data set types. I would suggest test-datasets/data/genomics/homo_sapiens/pangenome. As the files should allow a smokey execution of the module, I would suggest the very light https://github.com/pangenome/pggb/blob/master/data/HLA/V-352962.fa.gz human HLA genes as a test data set. A list of possible files could be

I think only wfmash is able to read zipped files. Else we need to leave these uncompressed.

Describe alternatives you've considered

Use existing data. I don't think this makes sense, because existing FASTAs usually only have 1 sequence incorporated, so we can't build a pangenome. Existing assembly GFA files do not encode a variation graph.

Additional context

The nf-core/pangenome pipeline brings some new challenges to nf-core.

  1. The input is a FASTA file. And not FASTQ file(s). Therefore I doubt we will need a spreadsheet as input. Wondering how this will affect the meta parameter....
  2. Lot's of the file types produced and consumed by the tools are currently not present in any test data repository.
  3. The data structures holding this novel data types require that certain data types follow a pangenome standard, e.g. a GFA can only be used sensible if it encodes a variation graph.

Therefore, I want to put to discussion, if we can add another key to the test data sets.

heuermh commented 2 years ago

I think it sounds reasonable to add all these as test data for nf-core/modules, as long as we keep the file sizes as small as possible.

I think only wfmash is able to read zipped files. Else we need to leave these uncompressed.

Sounds good, I think we might want to save dealing with compressed files until later.

subwaystation commented 2 years ago

All test data for all modules is present.