iqbal-lab-org / gramtools

Genome inference from a population reference graph
MIT License
92 stars 15 forks source link

Release 0.5 datasets #74

Closed iqbal-lab closed 6 years ago

iqbal-lab commented 6 years ago

1. P. falciparum good whole genome PRG.

Use the pf3k cortex calls + DBLMSPs PRG

2. TB good whole genome PRG

SNPs and small indels above frequency 0.01 Combine the VCFs of 500 M. tuberculosis samples, genotype those 500 samples at the deduplicated site-list, get a single VCF, exclude SNP/indels below 1% frequency

What do we want for these datasets?

iqbal-lab commented 6 years ago

For the record, I think item 7 is least priority of these. I tried running it on version 0.5.0, last git commit 97b3a4.The PRG was constructed, and kmer indexing started, but asfter ~36 hours, having indexed 261 million out of 3.7 billion, it hit the LSF memory limit of 150Gb RAM.

iqbal-lab commented 6 years ago

I've removed some excess datasets from this list. For historical purposes those datasets were:

p falciparum graph on human background

5. S. cerevisiae (bakers yeast)

Take the 93 genomes here: http://genome.cshlp.org/content/25/5/762.full and run Cortex on them (Martin can help) The following paper is good background, see Fig 3 in particular to give a sense of how much variation we are ignoring with our model https://www.nature.com/ng/journal/v49/n6/full/ng.3847.html Provides a set of 7 S.cerevisiae whole genomes.

6. S. pombe (brewers yeast)

Get the 161 samples from http://www.nature.com/ng/journal/v47/n3/full/ng.3215.html , run Cortex on them and get a VCF from that.

7. Human with 1000g variants above freq 0.05

ffranr commented 6 years ago

@iqbal-lab I would like to close this issue. I've reorganized everything as a project. Is that ok with you?

The project references this issue. Nothing is lost.

iqbal-lab commented 6 years ago

Ywp