Currently, the implementation of LDdecay tool is a complete mess. The main problem is that I didn't design it in a proper way because it was a proof-of-concept very useful for analyze a dataset, but not from the maintenance/development point of view. That's why when I ported the tool I deprecated most of the classes used by LDdecay.
We should either implement the tool from scratch based on the classes already implemented, or modify them a lot. Although it will reduce a bit the performance, there are several things that should be done:
Statistics to compute should be a command line plugin. This would allow the user to compute just r2 or r2' (normalized version), or several at the same time to safe time on computing frequency between all pairs and filtering twice. This is already done, and it will reduce performance because we will require to perform every computation independently without reusing data from rw, for instance. But it will allow to add more functions for computing LD, like D'. In addition, implementing cached values (#29) will reduce this lost of performance.
Use AlleleVectorwill allow to cached both construction of pair-wise frequencies (for a limited number of haplotypes, I expect that some constructions are just found several times within a dataset) and computations of statistics. Because we need the position, a simple Tuple will be easy to keep in the queue. In addition, AlleleVector could be easily transformed to biallelic, and like that we can workaround the limitation of multi-allelic sites (keeping the two major alleles).
Currently, LDdecay is performing filtering at two levels: first, if the variant does not fit the requirements (biallelic/singleton/missing) to avoid store them into the queue; second, for computation, it is removing pair of variants that does not fit the same requirements. We should use the same class for filter them to keep them on sync, more if we allow to use custom filters or we increment the complexity by adding new ones in the command line (#46). I don't know yet what is the best approach for this.
Command line parameter to decide which quantiles should be used for compute the distribution, with default values (related to #40). This will get a performance advantage for people who does not need to compute distributions of r2.
Probably other things should be changed here, but this are the ones that comes to my mind after porting the tool.
Currently, the implementation of
LDdecay
tool is a complete mess. The main problem is that I didn't design it in a proper way because it was a proof-of-concept very useful for analyze a dataset, but not from the maintenance/development point of view. That's why when I ported the tool I deprecated most of the classes used byLDdecay
.We should either implement the tool from scratch based on the classes already implemented, or modify them a lot. Although it will reduce a bit the performance, there are several things that should be done:
AlleleVector
will allow to cached both construction of pair-wise frequencies (for a limited number of haplotypes, I expect that some constructions are just found several times within a dataset) and computations of statistics. Because we need the position, a simpleTuple
will be easy to keep in the queue. In addition,AlleleVector
could be easily transformed to biallelic, and like that we can workaround the limitation of multi-allelic sites (keeping the two major alleles).LDdecay
is performing filtering at two levels: first, if the variant does not fit the requirements (biallelic/singleton/missing) to avoid store them into the queue; second, for computation, it is removing pair of variants that does not fit the same requirements. We should use the same class for filter them to keep them on sync, more if we allow to use custom filters or we increment the complexity by adding new ones in the command line (#46). I don't know yet what is the best approach for this.Probably other things should be changed here, but this are the ones that comes to my mind after porting the tool.