octopize / octopize-linkage

MIT License
1 stars 1 forks source link

Add targeted contribution pre-linkage metric #12

Open olivierabz opened 3 days ago

olivierabz commented 3 days ago

I suggest adding a metric that does not just represent how much a set of selected columns represent a whole dataset (as done with the contribution score, but instead, represents how much the selected columns represent a specific set of variables (or how much they are correlated with them).

This means specifying a target set of variables. IMO, it is relevant to future use in practice where we may want to enrich a dataset A with data from a dataset B for a specific task (e.g. predict variable B_3 from variables A_1, A_2, B_1, B_2). In that case, the linkage variables should be well correlated with target variables B_3

olivierabz commented 2 days ago

Aim: predict B3 from A1, AB2, B1, predict salary from loan, age, university

salary: dataset B loan: dataset A age: datasets A and B university: dataset A

Linkage variables: AB2, AB3, AB4

age, nationality, residence, height, ... (both in A and B)

run target metrics at source B: corr(AB2, AB3, AB4) -> B3 feature importance(AB2, AB3, AB4) -> B3 error(AB2, AB3, AB4) -> B3

is there a corr between {age, nationality, residence, height} and salary ?

extreme case: using matricule ?