Closed FriederikeHanssen closed 2 years ago
According to the paper:
"To create a tumor-normal pair at a desired purity, we take three sequenced samples: a “pure tumor” sample and two sister samples that are distant to the tumor. The two sister samples are considered “normal” (relative to the “pure tumor”). We informatically mix one of the normal samples with the “pure tumor” to create the case tumor sample, and use the other normal as the case normal, for somatic variant discovery pipelines that are run with the matched normal. Two sister samples are needed to act as the normal sample because there is not enough coverage in one sample to use as both a mixed-in normal in addition to a matched normal."
From this I understand we can use one sample as tumor and then pick one distantly related one as normal one, such as S56 as normal and S54 as tumor (Image from the paper, linked above):
To keep costs and storage size low, I would propose a single pair for now. If we have credits left, we can always add more later on
@szilvajuhos What do you think about that?
It is not forgotten, will have a look after Tuesday.
I am interested in finding a dataset to use for benchmarking the computing platform and so this pipeline.
Are there improvements on this ?
Question is what sort of benchmark we want to run. For sensitvity/precision benchmarks the coverage should be something like 60x/30x at least. To run a system test we need much less stuff, i.e. a WES should be fine. My problem with the dataset mentioned in the paper that it is already aligned to HG19 (why my dear, why HG19 in 2020?), and we do not have the raw data. I know we can make raw FASTQs, but it is a lot of work.
What if for initial run tests we are making an artifical WES using this dataset and we can test at least the pipeline on AWS? This would not be suitable for benchmarking, but real benchmarks are a pretty different business anyway.
For tests, we will soon have this, thanks to @FriederikeHanssen cf https://github.com/nf-core/test-datasets/pull/241/files
As posted on Slack: Would be really cool to leverage some "standard" benchmarking datasets for this sort of thing to directly compare Sarek to existing benchmarks e.g.
Example of a paper using the Genome in a bottle data for benchmarking.
Data for some actual somatic benchmarking: https://www.nature.com/articles/s41587-021-00993-6 avaiable on SRA without restriction as far as I see and we can use it to actual see how acurate calls are. (POssibly out of scop for the AWS test due to datasize, but I would still leave it here for th emoment)
see PR #580
Posted this in Slack a while back but did you see this paper from the guys at Google responsible for Deepvariant? https://www.biorxiv.org/content/10.1101/2020.12.11.422022v1
"To facilitate community use on standardized, processing-ready samples, we generated serial downsamples of the WGS at 50x, 40x, 30x, and 20x uniquely mapped coverage, and of the exomes at 100x, 75x, and 50x coverage of the kit capture regions. These are mapped to GRCh37 and GRCh38 with ALT contigs in an ALT-aware manner. At higher coverages, some samples are not present due to insufficient coverage. In total, this covers 246 WGS BAM files and 218 WES BAM files. These files are available for public download (see data availability) without access control or restriction. Download URLs for FASTQs (Supplementary File 1), BAMs (Supplementary File 2), and VCFs (Supplementary File 3) are included."
Looks like quite a comprehensive set of samples for both WES and WGS that you could use directly for benchmarking.
✔️ for germline full size test, see #604
for somatic tests see #652
I haven't found a related issue, so far. In case I missed it, we can just add this there.
Is your feature request related to a problem? Please describe
As of this year, we try to add full-size tests to all pipelines to run with aws and then display the results on the homepage.
Describe the solution you'd like
One dataset suitable for this could be the one described in this paper:
Describe alternatives you've considered
Haven't searched for other suitable datasets, but we could use this thread to collect more before deciding on one for the test run.