tskit-dev / tskit

Population-scale genomics
MIT License
147 stars 70 forks source link

Python interface for 2-locus statistics #2829

Open lkirk opened 11 months ago

lkirk commented 11 months ago

Implement a python interface to the 2-locus statistics so that we may begin more rigorous testing.

lkirk commented 10 months ago

After talking with @apragsdale, I think that the LD interface will differ from the other stats interfaces. My understanding is that tskit typically groups stats by name and allows parameters for a given mode to be passed through a single entrypoint ts.stat(params, mode="site/branch/node"). Because the parameters for each mode differ so drastically, we propose that we instead group the stats by mode in the API instead of by stat name. We've sketched out what this might look like:

LD Matrix Only acts on sites, no branch stats.

ts.ld_matrix(summary_func='r2', sites=[[0, 1, 2], [2, 3]], sample_sets=[...])
output_shape = [n_sites, m_sites, n_sample_sets]

LD scores Mode specifies whether or not we're comparing sites x branches or sites x sites

ts.ld_scores(summary_func='r2', sites=[0, 2, 3], max_dist=100.5, sample_sets=[...], mode="site/branch")
output_shape = [n_sites, n_sample_sets]

LD Decay Summarize LD decay for branches or sites. The distance bins will be applied within a specific window.

ts.ld_decay(summary_func='r2', windows=[...], distance_bins=[...], mode="site/branch")
output_shape = [n_windows, n_distance_bins, n_sample_sets]

@petrelharp @jeromekelleher what are your thoughts about organizing things in this way?

apragsdale commented 10 months ago

To expand on the motivation here a bit: I don't think that the LD matrix vs scores vs decay represent different modes, in the way that branch/site/node are different modes to compute statistics (like diversity).

But rather, matrix/scores/decay are different two-locus summaries of the data with return values that have different shapes and interpretations, but the same shape and interpretation within functions for different summary functions (like r^2, r, or D^2).

These different functions can (sometimes) be used with either site or branch mode (except for the LD matrix, for which only "site" makes sense). Within each of these larger LD functions, you may wish to compute the LD matrix, for example, using many different summary functions ('r2', 'D2', 'r', 'D', 'r*r' across populations, 'D*D', and many others), so it seemed to make a lot of sense to move the summary functions inside these broader matrix/scores/decay functions as an argument, to keep the API more organized.

A fourth object to compute is the full two-locus haplotype sampling distribution (or two-locus frequency spectrum), which fits conceptually (and algorithmically) within the LD decay function, but wouldn't apply a summary function and instead fill in the sampling distribution, which may be quite large depending on sample sets, and sparse when in site mode.

petrelharp commented 10 months ago

I like it! It seems just right.

except for the LD matrix, for which only "site" makes sense

I think that a "branch" version of this would be the expected LD matrix for a collection of positions along the genome? (i.e., the [i,j]th entry would be the expected value of the LD between two mutations randomly placed at the ith and jth position)

apragsdale commented 10 months ago

Oh that’s true! I hadn’t thought of that. It would require being able to pass positions instead of site indexes but makes sense conceptually. Not sure if it’s high priority on the list of LD functions, but I like it. Can open an issue about that to keep track of the idea.

On Sep 6, 2023, at 10:23 PM, Peter Ralph @.***> wrote:



I like it! It seems just right.

except for the LD matrix, for which only "site" makes sense

I think that a "branch" version of this would be the expected LD matrix for a collection of positions along the genome? (i.e., the [i,j]th entry would be the expected value of the LD between two mutations randomly placed at the ith and jth position)

— Reply to this email directly, view it on GitHubhttps://github.com/tskit-dev/tskit/issues/2829#issuecomment-1709419505, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIBL46SYVDOPXZYRGET76X3XZE427ANCNFSM6AAAAAA4DZVYRI. You are receiving this because you were mentioned.Message ID: @.***>

jeromekelleher commented 10 months ago

I like this way of breaking things down too, very nice. How about

ts.ld_matrix(sites=[[0, 1, 2], [2, 3]], stat="r2")

So, you're computing the LD matrix a given set of sites using a given "stat" (default=r2). I think "summary_func" might not be that intuitive for this?

You could also imagine doing this:

ts.ld_matrix(positions=[[123, 12345, 12345], [20000, 30000]], stat="r2", mode="branch")

You could make the first argument be a "positional only" arg which specifies the rows and columns of the matrix, and have that be interpreted as either sites or physical position values, depending on the context. However, I think this might be more useful, because you could easily imagine doing

ts.ld_matrix(positions=[[123, 12345, 12345], [20000, 30000]], stat="r2", mode="site")

and requiring that the positions correspond to the genomic coordinates of sites.

So, I think I'm advocating for two, mutually exclusive arguments (at the Python level) which can either specify site IDs or physical positions.

lkirk commented 5 months ago

@jeromekelleher I didn't quite see the subtlety between the last two you proposed until just now, specifically, I didn't see the positions/sites mode. I had assumed that users would provide site ids. Are you suggesting that users specify sites by position?

jeromekelleher commented 5 months ago

Yes, just as a convenience though. No need in the first pass.