sgkit-dev / sgkit-publication

Sgkit publication repository
5 stars 5 forks source link

Quantgen use-case #9

Open jeromekelleher opened 1 year ago

jeromekelleher commented 1 year ago

Following a similar pattern to the popgen use-case (#8) we'd like to situate and illustrate the quantitative genetics part of sgkit. Here's a rough structure to get started (paragraphs)

  1. Who are the target audience (i.e., what is quantgen?), what do they do, and what tools do they use to do it? What are the limitations of these tools?
  2. Sketch of sgkit's quantitative genetics API
  3. Example of doing something useful with (a subset of) the pedigree methods (mixed ploidy?)
  4. Example of doing something useful with the genomics methods (e.g. GRM?)

For 3 and 4 we want to be quite concrete, using a specific dataset and given precise numbers about file sizes and times it takes to run specific methods (if applicable, we can mention the time it takes to do the same thing in some other tool?). Similarly to #8 it would be nice if the analysis could go into a notebook that we store in this repo and which we could make available as supplementary material (as PDF). I don't think we need to make these notebooks strictly "runnable" though, so I think it's OK if the datasets used aren't freely available.

I don't really know any open datasets that might be appropriate - I'm hoping @timothymillar might have some suggestions.

Does this outline seem sensible to you @timothymillar?

timothymillar commented 1 year ago

This looks good @jeromekelleher. Seeing it sketched out like this makes me think that a story around genomic prediction via (G)BLUP would be ideal. This would tie in the pedigree (inverse) relationship matrix with the GRM. However, we don't have a BLUP LMM implementation yet... I think we could get a simple implementation going without too much hassle using da.linalg.solve but it may not be the most performant. With or without a BLUP method, the focus should still be on kinship/relationship matrices.

The largest publicly available (diploid) pedigrees I know of are published by Cleveland et al 2012 (6473 individuals) and Bérénos et al 2014 (6740 individuals) so not exactly huge. However, these datasets are particularly good for examples as they also include genotypes for (a subset of) samples in addition to trait and heritability estimates. There are also a few simulated pedigrees around, e.g. Habier et al 2013 who simulated pedigree and marker data to investigate BLUP methods.

I'm not aware of any large publicly available polyploid pedigrees. However, polyploidy should have little effect on method performance and we can easily mock it using a diploid pedigree. I'll keep hunting for more examples. I agree that the data don't need to be freely available, but it would be preferable.

jeromekelleher commented 1 year ago

Sounds great @timothymillar! I don't think we need to use huge datasets for every example - I think diversity across different types of data and organisms is more useful really.

timothymillar commented 1 year ago

@jeromekelleher After reading your distributed computing blurb, I've been re-considering how we can do chunked computation of pedigree kinship/relationship matrices. I have some rough code that works, although the performance (real time) isn't great (dicts slow down numba quite a bit). It does help with memory usage though and I can calculate individual chunks from kinship matrices that are too large to fit in RAM. Both memory usage and compute time depend on the pedigree structure.

I personally think it would add value to the paper if we can demonstrate chunked/distributed computation of both genomic and pedigree relationship matrices. But, the code is quite complex so I'd like a second opinion before perusing it much further.

jeromekelleher commented 1 year ago

I think we should be driven by applications @timothymillar - if we don't have any pedigrees that we are currently using that require going "distributed" then I think it's fine to assume single machine for now. Do we make any hard assumptions in our API that would preclude later generalisations to multi-node?

timothymillar commented 1 year ago

That's a reasonable approach. There are no blockers with the current API although I'd probably require that a pedigree is already sorted to use chunking, it simplifies things a lot.

jeromekelleher commented 1 year ago

SGTM. We could add a sentence or two discussing this to the paper? Saying we assume stuff is in-core at the moment, but there's a clear path to scaling out, if the need arises (if working with, e.g., the BALSAC pedigree?