graspologic-org / graspologic

Python package for graph statistics
https://graspologic-org.github.io/graspologic/
MIT License
811 stars 143 forks source link

Directed and Indefinite SeedlessProcrustes #537

Open alyakin314 opened 4 years ago

alyakin314 commented 4 years ago

Expected Behavior

When using SeedlessProcrustes to align two directed embeddings, rather than stacking the two embeddings together and then using seedless, one should first use seedless on them independently, and then stack them. the motivation here is that otherwise the information from one part of directedness ("from") will spill into the other part ("to") and we do not want that.

This will require some reordering and reorganization of the LDT code (and that's it).

Actual Behavior

This totally slipped out of my mind, and we are currently stacking the embeddings and subsequently calling the appropriate aligner.

upd: see two posts down for what should actually happen.

alyakin314 commented 4 years ago

Not sure if bug or enhancement.

alyakin314 commented 4 years ago

After further conversation with @jagterberg , it seems that what should happen is something different:

First, SeedlessProcrustes needs to feature another parameter, let's call it q for now. The optimal transport is performed in an unmodified way. However, the procrustes part of it should take the first q components and perform procrustes on them, then take the last d-q components and perform procrustes on them. Lastly, the two matrices are assembled into a single one using a direct sum. This results in a block-diagonal matrix. The rest of the algorithm is unmodified. This will be used both by directed and indefinite versions.

For the SeedlessProcrustes with directed graphs, q just needs to be set to d / 2. Note that d is not the dimension we embed in, but the dimension we end up with after concatenating, so it's 2x that of which we embed into.

For the Indefinite SeedlessProcrustes, what needs to happen is the following:

  1. Embed the two graphs into dimension (let's call it d)
  2. Determine the eigenvalues of the components used in the embeddings (note that eigenvalues have signs, unlike singular values). Let the number of positive eigenvalues (out of the first d) be q.
  3. Sort the components according to corresponding eigenvalues (note: NOT according to their magnitude).
  4. Run the procrustes on these sorted embeddings, setting q=q.

There are other ways of implementing this: for example, one can instead be providing a vector of labels, associated with whether the corresponding eigenvalue is positive of negative: (e.g. eigenvalues [-0.5. 0.2, -0.1, 0.01]. would yield this [0, 1, 0, 1]).

@jagterberg is what i said above correct?

@bdpedigo @j1c @dwaynepryce i can't think of a good API for the indefinite case. We might have to create something that wraps embedding and aligning together.

alyakin314 commented 4 years ago

First, SeedlessProcrustes needs to feature another parameter, let's call it q for now. The optimal transport is performed in an unmodified way. However, the procrustes part of it should take the first q components and perform procrustes on them, then take the last d-q components and perform procrustes on them. Lastly, the two matrices are assembled into a single one using a direct sum. This results in a block-diagonal matrix. The rest of the algorithm is unmodified. This will be used both by directed and indefinite versions.

I should've added that it basically should mirror the R implementation here: https://github.com/jagterberg/nonparGraphTesting/blob/68154f1d5891b1ac914241e99c626f274b17ac4f/R/optimal_transport.R#L56-L73