[FEATURE] Add Datasets for AWS Full-size test

nf-core / sarek

Analysis pipeline to detect germline or somatic variants (pre-processing, variant calling and annotation) from WGS / targeted sequencing

https://nf-co.re/sarek

MIT License

404 stars 412 forks source link

[FEATURE] Add Datasets for AWS Full-size test #339

Closed FriederikeHanssen closed 2 years ago

FriederikeHanssen commented 3 years ago

I haven't found a related issue, so far. In case I missed it, we can just add this there.

Is your feature request related to a problem? Please describe

As of this year, we try to add full-size tests to all pipelines to run with aws and then display the results on the homepage.

Describe the solution you'd like

One dataset suitable for this could be the one described in this paper:

It is publically available on SRA: SRP159787
WGS, 35X coverage
Human origin
Colon cancer cell line

Describe alternatives you've considered

Haven't searched for other suitable datasets, but we could use this thread to collect more before deciding on one for the test run.

FriederikeHanssen commented 3 years ago

According to the paper:

"To create a tumor-normal pair at a desired purity, we take three sequenced samples: a “pure tumor” sample and two sister samples that are distant to the tumor. The two sister samples are considered “normal” (relative to the “pure tumor”). We informatically mix one of the normal samples with the “pure tumor” to create the case tumor sample, and use the other normal as the case normal, for somatic variant discovery pipelines that are run with the matched normal. Two sister samples are needed to act as the normal sample because there is not enough coverage in one sample to use as both a mixed-in normal in addition to a matched normal."

From this I understand we can use one sample as tumor and then pick one distantly related one as normal one, such as S56 as normal and S54 as tumor (Image from the paper, linked above):

Screenshot 2021-02-12 at 14 14 16

FriederikeHanssen commented 3 years ago

To keep costs and storage size low, I would propose a single pair for now. If we have credits left, we can always add more later on

maxulysse commented 3 years ago

@szilvajuhos What do you think about that?

szilvajuhos commented 3 years ago

It is not forgotten, will have a look after Tuesday.

rjpbonnal commented 3 years ago

I am interested in finding a dataset to use for benchmarking the computing platform and so this pipeline.

Are there improvements on this ?

szilvajuhos commented 3 years ago

Question is what sort of benchmark we want to run. For sensitvity/precision benchmarks the coverage should be something like 60x/30x at least. To run a system test we need much less stuff, i.e. a WES should be fine. My problem with the dataset mentioned in the paper that it is already aligned to HG19 (why my dear, why HG19 in 2020?), and we do not have the raw data. I know we can make raw FASTQs, but it is a lot of work.

What if for initial run tests we are making an artifical WES using this dataset and we can test at least the pipeline on AWS? This would not be suitable for benchmarking, but real benchmarks are a pretty different business anyway.

maxulysse commented 3 years ago

For tests, we will soon have this, thanks to @FriederikeHanssen cf https://github.com/nf-core/test-datasets/pull/241/files

drpatelh commented 3 years ago

As posted on Slack: Would be really cool to leverage some "standard" benchmarking datasets for this sort of thing to directly compare Sarek to existing benchmarks e.g.

Genome in a bottle
DREAM challenge - access issues?
Any others?

Example of a paper using the Genome in a bottle data for benchmarking.

FriederikeHanssen commented 2 years ago

Data for some actual somatic benchmarking: https://www.nature.com/articles/s41587-021-00993-6 avaiable on SRA without restriction as far as I see and we can use it to actual see how acurate calls are. (POssibly out of scop for the AWS test due to datasize, but I would still leave it here for th emoment)

FriederikeHanssen commented 2 years ago

see PR #580

drpatelh commented 2 years ago

Posted this in Slack a while back but did you see this paper from the guys at Google responsible for Deepvariant? https://www.biorxiv.org/content/10.1101/2020.12.11.422022v1

"To facilitate community use on standardized, processing-ready samples, we generated serial downsamples of the WGS at 50x, 40x, 30x, and 20x uniquely mapped coverage, and of the exomes at 100x, 75x, and 50x coverage of the kit capture regions. These are mapped to GRCh37 and GRCh38 with ALT contigs in an ALT-aware manner. At higher coverages, some samples are not present due to insufficient coverage. In total, this covers 246 WGS BAM files and 218 WES BAM files. These files are available for public download (see data availability) without access control or restriction. Download URLs for FASTQs (Supplementary File 1), BAMs (Supplementary File 2), and VCFs (Supplementary File 3) are included."

Looks like quite a comprehensive set of samples for both WES and WGS that you could use directly for benchmarking.

FriederikeHanssen commented 2 years ago

✔️ for germline full size test, see #604

FriederikeHanssen commented 2 years ago

for somatic tests see #652