netsiphd / netrd

A library for network {reconstruction, distances, dynamics}
https://netrd.readthedocs.io/en/latest/
MIT License
166 stars 43 forks source link

Separating feature construction from calculation of distances #174

Open sdmccabe opened 5 years ago

sdmccabe commented 5 years ago

Many of the graph distances here are essentially two-step processes:

  1. calculate some feature vector for each graph
  2. calculate a distance between those feature vectors

So, DegreeDivergence computes the degree histograms and NetSimile constructs a vector of "signatures," then DegreeDivergence computes the JSD and NetSimile computes Canberra distance. The choice here is sometimes principled, sometimes arbitrary.

It would be preferable, to make the package flexible, to make these slightly more explicitly distinct. In practice this would look a lot like how the thresholding issue for graph reconstructions was handled (#38).

A conceptual illustration using DegreeDivergence:

class DegreeDivergence(BaseDistance):
    def dist(self, G1, G2, measure_type='jsd'):
        # get the degrees
        # from degree sequences to degree histograms
        dist = measure(p1, p2, measure_type)

where measure is basically a wrapper around things like np.linalg.norm, sp.stats.wasserstein_distance, or our JSD function (#165).

leotrs commented 5 years ago

Something I hadn't thought of before is that this will allow us to write better tests for distances. Once we have done this change, we will be able to more easily write tests for step #1 (calculate features) and, separately, write tests for actual distance measures, i.e. tests for measure in your example.

This actually bumps up the priority of this in my mind.

leotrs commented 5 years ago

So what does this look like? I think the first step is to start categorizing the possible outputs of the first step. This will help elucidate what we need to provide for a standardized second step.

Step one: compute some statistic of the network. This can take the form of a vector (which may or may not be normalized and interpreted as a discrete probability distribution), a scalar, or a matrix. Step two: implement ways of comparing vectors, distributions, scalars, or matrices.

Is this really all we need? I guess there's also the experience from the LaplacianSpectral distance which adds an intermediate step of applying a smoothing kernel to the original output of a vector.

sdmccabe commented 5 years ago

An idea I've been playing around with, major breaking change so probably for a v1.0 milestone, is to add a few more methods to the base classes. This could more explicitly separate the steps, and avoid us abusing kwargs too much. This would be helpful for both distances and reconstructions. So something like

G1 = [some graph]
G2 = [some graph]

dist = netrd.distance.NetSimile()
dist.initialize(G1, G2, params)
D = dist.distance('canberra')

TS = [some time series]
recon = netrd.reconstruction.CorrelationMatrix()
weights = recon.construct(TS, params)
adj = recon.threshold(thresholder_params)
G = recon.graph(create_graph_params)

"Initialize" is definitely a placeholder verb here.

leotrs commented 5 years ago

In this example, what would initialize do and what would distance do?

On Mon, Jun 3, 2019 at 12:13 PM Stefan McCabe notifications@github.com wrote:

An idea I've been playing around with, major breaking change so probably for a v1.0 milestone, is to add a few more methods to the base classes. This could more explicitly separate the steps, and avoid us abusing kwargs too much. This would be helpful for both distances and reconstructions. So something like

G1 = [some graph] G2 = [some graph]

dist = netrd.distance.NetSimile() dist.initialize(G1, G2, params) D = dist.distance('canberra')

TS = [some time series] recon = netrd.reconstruction.CorrelationMatrix() weights = recon.construct(TS, params) adj = recon.threshold(thresholder_params) G = recon.graph(create_graph_params)

"Initialize" is definitely a placeholder verb here.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/netsiphd/netrd/issues/174?email_source=notifications&email_token=AAILYAACC2OEK66IIEMXU7DPYU7KPA5CNFSM4HLLRDUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWZ5DUQ#issuecomment-498323922, or mute the thread https://github.com/notifications/unsubscribe-auth/AAILYAHMMLHZDSPZLJQH3D3PYU7KPANCNFSM4HLLRDUA .

-- www.leotrs.com | leo@leotrs.com PhD student at the Network Science Institute, Northeastern University

sdmccabe commented 5 years ago

In the case of NetSimile, it would construct the feature matrices and signature vectors. Then you could cycle through distance() method calls to generate, e.g., Canberra, JSD, whatever, without having to recompute things over and over. Basically I'm calling for more OOP, more uses of self, etc.

leotrs commented 5 years ago

In this case we should also rethink the use of the results dictionary, as well as maybe using 3.7's data classes.

On Tue, Jun 4, 2019 at 10:17 AM Stefan McCabe notifications@github.com wrote:

In the case of NetSimile, it would construct the feature matrices and signature vectors. Then you could cycle through distance() method calls to generate, e.g., Canberra, JSD, whatever, without having to recompute things over and over. Basically I'm calling for more OOP, more uses of self, etc.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/netsiphd/netrd/issues/174?email_source=notifications&email_token=AAILYADI6WIFRA2T2VCOI3LPYZ2OPA5CNFSM4HLLRDUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW4W2CI#issuecomment-498691337, or mute the thread https://github.com/notifications/unsubscribe-auth/AAILYAHCFCVDHF577NZNHTLPYZ2OPANCNFSM4HLLRDUA .

-- www.leotrs.com | leo@leotrs.com PhD student at the Network Science Institute, Northeastern University

sdmccabe commented 5 years ago

Definitely open to reconsidering results. Less sure about data classes, since they're 3.7.

On Jun 4, 2019, at 10:44 AM, Leo Torres notifications@github.com wrote:

In this case we should also rethink the use of the results dictionary, as well as maybe using 3.7's data classes.

On Tue, Jun 4, 2019 at 10:17 AM Stefan McCabe notifications@github.com wrote:

In the case of NetSimile, it would construct the feature matrices and signature vectors. Then you could cycle through distance() method calls to generate, e.g., Canberra, JSD, whatever, without having to recompute things over and over. Basically I'm calling for more OOP, more uses of self, etc.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/netsiphd/netrd/issues/174?email_source=notifications&email_token=AAILYADI6WIFRA2T2VCOI3LPYZ2OPA5CNFSM4HLLRDUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW4W2CI#issuecomment-498691337, or mute the thread https://github.com/notifications/unsubscribe-auth/AAILYAHCFCVDHF577NZNHTLPYZ2OPANCNFSM4HLLRDUA .

-- www.leotrs.com | leo@leotrs.com PhD student at the Network Science Institute, Northeastern University — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

leotrs commented 5 years ago

Is there a way to use dataclasses if 3.7 and not otherwise

sdmccabe commented 5 years ago

I think dataclasses are basically a rip-off of attrs, so we could look into that.

On Jun 4, 2019, at 9:00 AM, Leo Torres notifications@github.com wrote:

Is there a way to use dataclasses if 3.7 and not otherwise

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

leotrs commented 5 years ago

Fair

On Tue, Jun 4, 2019 at 5:34 PM Stefan McCabe notifications@github.com wrote:

I think dataclasses are basically a rip-off of attrs, so we could look into that.

On Jun 4, 2019, at 9:00 AM, Leo Torres notifications@github.com wrote:

Is there a way to use dataclasses if 3.7 and not otherwise

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/netsiphd/netrd/issues/174?email_source=notifications&email_token=AAILYABK6FVRJMSDVGGFHYLPY3NUVA5CNFSM4HLLRDUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW56C2I#issuecomment-498852201, or mute the thread https://github.com/notifications/unsubscribe-auth/AAILYAHXCFLHQHW5SNWISLLPY3NUVANCNFSM4HLLRDUA .

-- www.leotrs.com | leo@leotrs.com PhD student at the Network Science Institute, Northeastern University

sdmccabe commented 5 years ago

On the difference between attrs and dataclasses

I think this could be a good fit, but need to read up a little more on the project.

sdmccabe commented 5 years ago

Revisiting this issue, I think one possibility is to have BaseDistance do more of the heavy lifting. For many (though not all) of the distances, what we're fundamentally interested in is constructing features input to whatever measure. So we modify BaseDistance to something like the following:

BaseDistance():
    __init__(arguments)
    initialize()    # feature construction
    dist()          # measures constructed from self.features
    self.features   # a 2-tuple of the input features created by initialize()
    self.parameters # store the parameters for the distance for reference

And then when we write a distance, we inherit from BaseDistance and override the relevant methods, usually just __init__() and initialize(). For example:

NetSimile(BaseDistance):
    __init__(arguments) # set self.parameters
    initialize()        # create the feature matrices -> signature vectors
    self.features
    self.parameters

    # and then the NetSimile-specific things that are currently in self.results:
    self.signature_vectors # identical to self.features
    self.feature_matrices  

And then don't touch dist(); so usage is something like:

NetSimile(G1, G2).initialize().dist('canberra')

I think this helps us separate features and distances and get past the clunky self.results but requires us to write this omnibus BaseDistance.dist() method. Fortunately, we've split some of the functions (e.g., JSD) off already and I think @jkbren has some code for this already.

Where do I think this won't work?

  1. we probably need some way of distinguishing what is and is not invariant to graph labelling
  2. some distances are just distances, with the nodes as the features (Jaccard, for example), and fit a little oddly into this framework

This could be simplified a little by doing everything in __init__, but we might have some reason for separating the initialization of the class from the feature construction process (I don't know if we gain anything from doing so, though.)

leotrs commented 5 years ago

I rather like this. In your example, what would go in self.features?

How is the omnibus method going to handle NotImplemented? For example, in NonBacktrackingSpectral, the features that are compared are vectors, so I would expect NonBacktrackingSpectral(G1, G2).initialize().dist('Frobenius') to error out as 'Frobenius' is defined for matrices, not vectors. Ideally, this would error out as soon as possible and hopefully with an in-house exception class (as opposed to letting numpy error out downstream).

To your comments:

  1. Why?
  2. I think that's fine, we would just have to get clever with what gets stored in self.features.

Lastly, I don't like initialize, I vote for featurize, but I'm getting ahead of myself.

Lastly lastly, initialize should take care to only compute things once.

sdmccabe commented 5 years ago

In my example, NetSimile.initialize() would construct the feature vectors, and assign them as self.signature_vectors and self.features. NetSimile.dist() would use self.features to calculate the distance. So some of the logic in BaseDistance.dist() would include:

if method == 'canberra':
    return canberra_distance(self.features[0], self.features[1])

Basically the dist function is just one giant select statement. We can do error checking there too, and I think ValueErrors are fine for this:

if method == 'frobenius':
     if len(self.features[0].shape) > 1:
           raise ValueError("Frobenius not compatible with 1-D feature arrays")
    return frobenius_distance(self.features[0], self.features[1])

I think it might make sense for the class to store a default distance, so e.g., if you did NetSimile().initialize(G1, G2).dist() it would automatically use Canberra. This makes things simpler for users.

I also don't like initialize. featurize might work; based on some of the language from the graphend draft I was also thinking describe.

Maybe we don't need to take special care with labelling; we'll see what happens when we start writing, I guess?

@jkbren was concerned that adding two verbs might be confusing for users and proposed that we fold all the work of initialize into __init__, so that, with a default distance, the usage would only change from NetSimile().dist(G1, G2, <params> to NetSimile(G1, G2, <params>).dist().

leotrs commented 5 years ago

I agree on the two verbs being confusing, and the default distance. I don't think that __init__ should do everything, I think we should delay the heavy lifting as much as possible.

I vote for NetSimile(G1, G2, params).dist(). This is closer to sklearn than the others. We can have dist use any self.features if they exist, or call self.featurize() if they don't. So the user never calls it explicitly.

EDIT: I'm assuming that self.featurize() only needs to be called once, because I'm also assuming that G1 and G2 never change. This may not be the case. I think a good(?) workflow would be:

obj = NetSimile(G1, G2, params)
dist1 = obj.dist()    # dist() is calling featurize() under the hood
... change G1 ...
obj.featurize()       # gotta call it manually now because otherwise
dist2 = obj.dist()    # this call would use the features computed before G1 changed
sdmccabe commented 5 years ago

Your edit is pretty close to what I had in mind. I envisioned G1 and G2 being passed to featurize to make this updating process more explicit, but something along these lines should be good.

leotrs commented 5 years ago

I guess it could support both. calling featurize() would recompute the features of the stored G1, G2. Calling featurize(G3, G4) would first forget about G1, G2, and compute the features of G3, G4. This latter call preserves the parameters passed during construction.

This makes sense to me. Does it make sense to everyone? Is it too complicated?

sdmccabe commented 5 years ago

How would graph edit distances (#227) fit into this framework?

leotrs commented 5 years ago

Not sure if it should. For example, the metrics here are also a head-scratcher. Most of what we have implemented right now definitely follows the 2-step model of featurization and comparison. The ones mentioned don't.

One option is to have one BaseDistance with two children FeatureComparisonDistance and DirectComparisonDistance. Most of what we have right now, and this entire issue would go into FeatureComparisonDistance. Edit distances and others would inherit from DirectComparisonDistance instead. I think some version of this is the "correct design" answer.

Another option, the "hacky design" option, is to just have everything inherit from one BaseDistance, which implements the 2-step method, and shoe-horn everything in it. Both edit distances and Stratis' distances use the adjacency matrix so we can put that into self.features. They would just need a custom distmethod.

sdmccabe commented 5 years ago

I like the "correct design" answer; I think, done well, it shouldn't add complexity for the users, only for future contributors. I think I'm fine with that; a lot of our initial design decisions were designed for easy collaboration (during the collabathon) but now that we've got a good number of measures added we can (and have) slow down development and focus more on getting a good design.

On the "hacky design", this is close to how I had been thinking about Jaccard and Hamming: featurize() just makes self.features the edge sets, and then Jaccard/Hamming themselves are called in dist().

leotrs commented 5 years ago

I say we leave BaseDistance (almost) as it is, create a FeatureComparisonDistance and start putting everything under it, and leave a placeholder for DirectComparison (I don't like that name).

This is of course a huge refactoring. I think the first step is to have a lit of which methods can be naturally put in the 2-step scenario.

sdmccabe commented 5 years ago

A first pass, a couple are probably wrong:

Feature-based:

Not feature-based, or awkwardly feature-based:

leotrs commented 5 years ago

The case of H-I-M is interesting. The comparison method is not really a metric, but a linear combination of many. How does the omnibus method handle this?

Also, what if there are conditional features? Also in the case of H-I-M, the computation of one of the features is optional (because it takes a long time to compute). So on the one hand we need featurize() to accept arguments, while on the other hand, some custom dist() calls will do something different depending on what was stored in self.features by the last call to featurize().

I guess all that is already trivially handled by our hypothetical design? I'm just thinking out loud here.

sdmccabe commented 5 years ago

I think at this point we should just make a draft PR and figure out what breaks.

I'm interested in taking a crack at it. I'm swamped with APSA and making teaching materials for the rest of August, but working on netrd is a way to procrastinate from that so I could probably make a decent dent in it. I don't know what your schedule is like but if you also want to start, feel free.

Either way, whoever does should consider allowing edits from maintainers on the draft PR so that we could go back and forth on it.

leotrs commented 5 years ago

I won't have time to start on this until September, so please feel free to go for it if you want.