karlrohe / longpca

A formula interface for model-first PCA; PCA for the people!
https://karlrohe.github.io/longpca
Other
10 stars 1 forks source link

diagnostics / operations **before** calling pca? #1

Open karlrohe opened 9 months ago

karlrohe commented 9 months ago

What diagnostics/operations do we want to have access to before calling pca_?

Right now,

1) diagnose looks at degree stuff 2) pick_dim computes cross-validated eigenvalues 3) coming soon-ish: ability to chop off low degree nodes (maybe compute k-core), pre-pca.

All of this makes it seem like there would be some speed and convenience (for some) to build a new class ("interaction model"?). So, you could still use pca_sum(outcome~row*col, data, k). But also, you could precompute the im object (for diagnostics) and even edit the data...

im = as_im(outcome~row*col, data)
diagnose(im)
im_new = core(im, k=3)

Then, pca_sum also accepts these im objects:

pcs = im_new %>% pca_sum(k=4)

What are other popular diagnostics that folks run before computing pca? Perhaps:

1) sqrt or log(x +1) the elements of the matrix? 2) drop nodes?
3) weight the vertices? 4) others... please suggest!

karlrohe commented 9 months ago

a vote for weighting vertices!?

karlrohe commented 9 months ago

im objects could also be passed to other matrix decompositions!

karlrohe commented 9 months ago

im looks too much like lm? perhaps io for "interaction object"? other names??

karlrohe commented 9 months ago

at some point, we will want to "append" variables to the row_universe, column_universe, or "values inside the matrix". These values could be useful for interpreting the pcs. Moreover, they might be used to fit some models. For example,

1) we might have a "treatment variable" on values inside the matrix or
2) perhaps if it is hyper-linked text corpus, then we have text on the row_universe and we might do something like pairGraphText

so, this is an operation that we will want to be able to perform on the interaction_model object.

karlrohe commented 9 months ago

another operation we might want to perform... in the case of low-rank matrix completion, we might want to fit a "fixed effect model" (using row/column id's as the factors) to center the data before fitting.

karlrohe commented 9 months ago

If our nodes are journal articles, sometimes it makes sense to "block model" by journal. We did this in the example journal graph used in cv_eigen + vsp + tsg.

empirically, i think this is a super powerful data operation. not that interesting theoretically or methodologically. So, the literature never talks about it.

is this something that we want to enable after make_interaction_model? Or, is this something that... if you want to do that blocking... you should do it in your tibble, then make a new formula?

if it is an operation performed on an interaction_model, then we would need to first im = append(im, tibble_giving_paper_journal). Then... would it be an argument to pca? or would there be another function like im= block(im, journal) and it might just change something like im$setting$rowxxxx = journal?