seqeralabs / nf-sentieon

POC Nextflow pipeline to run Sentieon software
Mozilla Public License 2.0
5 stars 4 forks source link

Add germline variant calling test data #2

Open drpatelh opened 2 years ago

drpatelh commented 2 years ago

Description of feature

We are currently using a minimal test dataset for SARS-CoV-2 which is sufficient to test the pipeline but we don't have dbSNP and indel files for this reference.

It would be good to have an additional -profile test_germline for test data created by Sentieon as part of their Quick start docs. This is a small dataset for germline variant calling from part of NA12878/HG001.

drpatelh commented 2 years ago

It would be nice to host the uncompressed data on S3 somewhere so we can use links to individual reference/input files. One option would have been to upload them to Github like we do on nf-core but the FastQ files are too large (~100M).

DonFreed commented 2 years ago

Agreed that the Sentieon Quickstart package is too large for this case.

Maybe we can leverage test datasets used for the Sarek (https://github.com/nf-core/test-datasets/tree/sarek) pipeline? In particular, the Sarek dataset contains trimmed dbSNP, Mills, and known indel VCFs along with small fastq files.