matloff / qeML

40 stars 14 forks source link

Feature request: KFOCI #10

Open lang-benjamin opened 11 months ago

lang-benjamin commented 11 months ago

There is a R package KPC that implements a more general and improved version of FOCI, called KFOCI (Kernel FOCI), that was proposed by Huang et al. The improvement over existing methods in certain settings is quite remarkable, thus I believe it would be a great addition to have functions similar to the ones for FOCI. What do you think? For categorical variables, it may even be possible to partially (i.e. as long as they have an order) refrain from creating dummy variables by using them as integer-based variables.

matloff commented 11 months ago

Thanks very much, excellent idea! I had not been aware of KPC.

Norm

On Wed, Dec 20, 2023 at 7:55 AM Benjamin Lang @.***> wrote:

There is a R package KPC https://cran.r-project.org/web/packages/KPC/ that implements a more general and improved version of FOCI, called KFOCI (Kernel FOCI), that was proposed by Huang et al. The improvement over existing methods in certain settings is quite remarkable, thus I believe it would be a great addition to have functions similar to the ones for FOCI. What do you think? For categorical variables, it may even be possible to partially (i.e. as long as they have an order) refrain from creating dummy variables by using them as integer-based variables.

— Reply to this email directly, view it on GitHub https://github.com/matloff/qeML/issues/10, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZ34ZLOQVF7TSDF4PZF3MLYKMC63AVCNFSM6AAAAABA5BO2LGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGA2TAOBUG4ZTENA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

lang-benjamin commented 11 months ago

Great, happy to hear that! Some more food for thoughts: If binary/categorical variables are included, there will be randomness when calling KFOCI (due to breaking ties in the k-NN graph). So it could make sense to multiply call KFOCI on the same data set and somehow condense or visualize the results. For the former some sort of stability selection could be done, e.g. as proposed in Section 2.3 https://onlinelibrary.wiley.com/doi/10.1002/sim.8955. This proposal is in a slightly different context but sounds generic and could be applicable to KFOCI (and FOCI) as well. Unfortunately, I do not know how this "stable set" would behave, maybe it is not a good idea because it could violate the nice property of Theorem 7 from Huang et al. Any thoughts?

matloff commented 11 months ago

Unfortunately, I don't have time to go through the theory in Huang et al, and anyway, remember that I was not involved in the theory behind FOCI.

However, re Kormaksson et al, I can at least offer a comment. Note the function qeFOCImult. It turns FOCI on m cores, resulting in m sets of features. The user can specify whether to take the union (aggressive) or intersection (conservative) of the m sets. It would seem that what Sec. 2.3 of Kormaksson et al does is somewhat similar in spirit to taking the intersection in qeFOCImult.

On Thu, Dec 21, 2023 at 11:20 AM Benjamin Lang @.***> wrote:

Great, happy to hear that! Some more food for thoughts: If binary/categorical variables are included, there will be randomness when calling KFOCI (due to breaking ties in the k-NN graph). So it could make sense to multiply call KFOCI on the same data set and somehow condense or visualize the results. For the former some sort of stability selection could be done, e.g. as proposed in Section 2.3 https://onlinelibrary.wiley.com/doi/10.1002/sim.8955. This proposal is in a slightly different context but sounds generic and could be applicable to KFOCI (and FOCI) as well. Unfortunately, I do not know how this "stable set" would behave, maybe it is not a good idea because it could violate the nice property of Theorem 7 from Huang et al. Any thoughts?

— Reply to this email directly, view it on GitHub https://github.com/matloff/qeML/issues/10#issuecomment-1866815372, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZ34ZLJGUVZXXUADRZR4ETYKSDWVAVCNFSM6AAAAABA5BO2LGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRWHAYTKMZXGI . You are receiving this because you commented.Message ID: @.***>

lang-benjamin commented 11 months ago

Thanks, point taken! Appreciate your comment, that exactly goes into the direction I was aiming for.

matloff commented 11 months ago

Good, please let me know what you find works well.

On Fri, Dec 22, 2023 at 11:09 AM Benjamin Lang @.***> wrote:

Thanks, point taken! Appreciate your comment, that exactly goes into the direction I was aiming for.

— Reply to this email directly, view it on GitHub https://github.com/matloff/qeML/issues/10#issuecomment-1868000150, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZ34ZOU3HPL2P6ZKUWX533YKXLH5AVCNFSM6AAAAABA5BO2LGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRYGAYDAMJVGA . You are receiving this because you commented.Message ID: @.***>

lang-benjamin commented 2 weeks ago

I think it is fair to say that KFOCI performs better than FOCI. Still, I found that the performance for linear or monotone relationships lacks 'power' (this is in line with other observations, e.g. in A survey of some recent developments in measures of association. Possible mitigation strategies might be to decrease the number K for the KNN-graph (e.g. n/40 instead of n/20) or to combine it with the selected variables from ncvreg::cv.ncvreg (although in this case, the resulting set of variables may be harder to interpret as there will be no clear ordering amongst the combined selected variables anymore).

I also found that the algorithm in Kormaksson et al (with r = 0.5) works quite well when applied to multiple independent runs of KFOCI on the same data set. One disadvantage resulting from that is that the ordering of variables gets lost; this may however be resolved by saving the ranks of each run and then investigate the tuples of ranks via cdparcoord::discparcoord.