Add "purity" filter - Githubissues

gaow commented 6 years ago

Currently I compute "purity" of sets separately based on LD and apply that to susie sets (too bad I coded it in Python). We should implement it as a separate function here. What should it be like? eg,

susie_get_sets(susie_in_CS(res), LD_mat = NULL, threshold = 0.2)

Or,

susie_get_sets(susie_in_CS(res), X = NULL, threshold = 0.2)

and we compute LD mat?

When LD mat is NULL, we just get the position of the variables for each set and report them as an R list of sets. With LD_mat filter we additionally compute minimal pairwise LD and remove the sets that fails the threshold?

stephens999 commented 6 years ago

I suggest instead operating directly on res, and we can just extend susie_in_CS to compute purity (if X or LD is provided) and to filter if purity_threshold is provided.

so susie_in_CS(res, LD_mat, purity_threshold)

maybe susie_get_CS is better? verb (get) better than adjective (in)?

we could have a separate function purity(sets,LDmat) that does the work?

On Fri, Jun 8, 2018 at 12:18 PM, gaow notifications@github.com wrote:

Currently I compute "purity" of sets separately based on LD and apply that to susie sets (too bad I coded it in Python). We should implement it as a separate function here. What should it be like? eg,

susie_get_sets(susie_in_CS(res), LD_mat = NULL, threshold = 0.2)

Or,

susie_get_sets(susie_in_CS(res), X = NULL, threshold = 0.2)

and we compute LD mat?

When LD mat is NULL, we just get the position of the variables for each set and report them as an R list of sets. With LD_mat filter we additionally compute minimal pairwise LD and remove the sets that fails the threshold?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stephenslab/susieR/issues/19, or mute the thread https://github.com/notifications/unsubscribe-auth/ABt4xTYY3PW1u4qMLwEzM-8igxJ34j7Hks5t6rHbgaJpZM4UgqtC .

gaow commented 6 years ago

@stephens999 sorry I just cleaned up mostly my todo list on simulation so getting back to this one now. I think currently we use in_CS to refer to a binary vector of length p with 1 for a variable in CS. We did not provide clusters of the exact position of variables. That would be more like get_CS.

How about we then make in_CS as well as n_in_CS not exported function, and implement get_CS to return a list of actual CS? I guess in_CS is not that useful.

BTW n_in_CS can be trivially computed from get_CS or in_CS results so we still decide not to export it, right?

we could have a separate function purity(sets,LDmat) that does the work?

I agree it is cleaner to have a separate function for developers when we want to explore how "purity" works. But I think now that we have some idea about it through separate studies, then here from user's prospective having less functions is better. So I'm voting for:

susie_get_CS(res, X = NULL, R = NULL, threshold = 0.2)

Buy default we just give multiple CS by index of variables
If either X or R is provided we use LD filter. and maybe call it correlation or r2 -- better not call it LD_mat for context outside genetic fine-mapping. What's your pick? The default threshold 0.2 is chosen from our experience in genetic mapping studies.
If both X and R are provided, throw error.

Does it sound Okay?

stephens999 commented 6 years ago

I think correlation, X'X, is more natural to provide than squared correlation?

stephens999 commented 6 years ago

So maybe Xcor ?

gaow commented 6 years ago

Cool, Xcor it is!

stephenslab / susieR

Add "purity" filter #19