Benchmark relationship matrices - Githubissues

sgkit-dev / sgkit-publication

Sgkit publication repository

5 stars 5 forks source link

Benchmark relationship matrices #42

Open timothymillar opened 10 months ago

timothymillar commented 10 months ago

Benchmarking against commonly used libraries using public data. The details listed here are a starting point and may change.

Data: Start with Cleveland, Hickey and Forni (2012). A widely used pig dataset with 6473 pedigreed individuals of which 3534 have been genotyped. The genotype data includes 52843 biallelic SNVs with missing values already imputed.

Scope: Commonly used estimators applied to diploid data.

genomic_relationship(estimator='VanRaden')
pedigree_kinship(method='diploid')
hybrid_relationship()
pc_relate() (?)
pedigree_inbreeding() (? not a matrix)

Comparisons:

R:
- AGHmatrix
- sommer
- ASRGenomics (calls AGHmatrix for some functions)

Notes/Concerns:

VanRaden estimator:
- Performance will largely be measuring a single matrix multiplication and hence this will primarily be a test of underlying linalg libs. These are highly parallelized so dask won't improve performance on a single machine.
- Some of the commonly used implementations (especially AGHmatrix) do a lot of additional validation and computation such as mean imputation and filtering by minor allele frequency. This requires some thought about what constitutes a reasonable comparison. One option would be to time the equivalent operations in sgkit. However, these operations being optional in sgkit is a strength that we should emphasize (they are often unnecessary).
- Potential to emphasize the advantage of dask here if we have a suitably large dataset that benefits from being spread across multiple nodes.

timothymillar commented 9 months ago

I ran into https://github.com/dask/dask/issues/10645 when benchmarking which is probably slowing us down in a few locations. Hopefully there will be a quick fix.

jeromekelleher commented 9 months ago

I guess the factor of 2 here is basically the difference between us and R implementations, since the time is dominated by the matmul, and we'd expect linear algebra implementations to be broadly equivalent (if not identical, maybe actually calling in to the same fortran code, ultimately).

A good angle might be to explore parallelisation - can we parallelise this over multiple threads with Dask (no point in going the full distributed across a cluster here, I think)? I expect that's quite hard with R implementations.

timothymillar commented 9 months ago

can we parallelise this over multiple threads with Dask

Yes, but it's not worth it unless distributing across multiple nodes. The matmul in numpy is already multi-threaded. Likewise, the equivalent functions in R are multithreaded (crossprod/tcrossprod etc.).

Based on a preliminary look, there is also quite a bit of difference in timings due to additional checks on the data which can be quite inefficient. Things like checking that dosages are integers, performing mean imputation, filtering by MAF and etc. Our implementation leaves these tasks to the user (they may not be wanted or appropriate!), but this makes it harder to perform a like-for-like comparison.

timothymillar commented 9 months ago

I'm also looking for implementations of pedigree kinship/relatedness to compare against. I think we will compare quite favorably but I'm currently being slowed down by the share number of implementations (mostly R packages).

jeromekelleher commented 9 months ago

I guess we can just say that the dominant operation is a matmul then, and quote some times based on that? We could time doing just the matmul in R, showing that we're basically calling into equivalently optimised code?

So, report out time end-to-end, and then quote just the time taken to do the matmul in R? Or would that be super fiddly?

timothymillar commented 9 months ago

I think that would work. Report end-to-end time for implementations and have R and Numpy timings for just the matmul. Maybe a sentence or two covering some of the preprocessing/validation that some of the other implementations do.

timothymillar commented 9 months ago

https://github.com/dask/dask/issues/10645 appears to be fixed in their latest release (2023.12.1).