Misleading operator names

stevenganz commented 4 years ago

Wikipedia provides the following on definitions of graph union:

graph union: G1 ∪ G2. There are two definitions. In the most common one, the disjoint union of graphs, the union is assumed to be disjoint. Less commonly (though more consistent with the general definition of union in mathematics) the union of two graphs is defined as the graph (V1 ∪ V2, E1 ∪ E2).

IMO, being consistent with the rest of mathematics is far more important than being consistent with some literature on the topic. I also think that the operation of taking a union of two graphs (derived from set unions of the nodes and edges) is very useful. And indeed, we provide two operators, union and disjoint_union, which would lead one to believe that the first is that operation, and that the second is some form of disjoint union. Unfortunately, they both perform disjoint unions. 'union' does a set union of nodes and edges, but throws an error if the sets are not disjoint, or provides an ability to rename common nodes to avoid conflict. 'disjoint_union' throws out the node names altogether and renumbers the combined set of nodes.

We provide the functionality of graph union (as derived from set union of the nodes and edges) in an operator that is unfortunately called 'compose'. There are various definitions of compose, including at least lexicographic and relational, of which I find relational to be more useful -- I don't believe we implement either.

This current state is a problem because people are likely to use a method called 'union' (especially when provided as an alternative to 'disjoint_union'), expecting it to operate in a manner different than it actually does. They will likely be disconcerted to find that they have to go to to another method with an unrelated name to get the functionality they want.

It is also a problem because those wanting graph composition are likely to end up with graph union.

For another instance of confusion this causes, see Issue #3422.

For those tempted to simply update the (formal) documentation, I feel strongly that variable names are by far the most important form of documentation.

My suggestion would be to remove the exception raise from the union operator, and to remove the compose operator entirely (so that it can later be replaced with something more appropriate). I'm not aware of anything that would be lost be going to the version of union that doesn't raise the exception in place of using the current compose operator (are you?). To highlight potential compatibility issues, at least removing 'compose', and perhaps both changes, could be done as part of a major release.

I can put together a PR for this if you agree.

rossbar commented 4 years ago

In addition to making docs more descriptive, it seems like adding explicit wording about compose in a more descriptive error message (see #3422) would improve the situation.

dschult commented 4 years ago

The union name will likely be confusing either way. If we take out the error, then people expecting the Graph Theory definition will silently get some nodes combined. The current setup means people expecting it to work when node sets are not disjoint at least get a noisy error instead of a silent confusing result.

Perhaps it is best to remove union in favor of the G.update method which simply adds the nodes and edges of one graph to another. That avoids the use of this problematic term. The name update comes from Python dict so is hopefully clear. We could also point to G.update in both disjoint_union as well as the docs for operators.

stevenganz commented 4 years ago

Well, the other side of the argument is that people expecting union to work when node sets are not disjoint will get an error every time, but people expecting it to give an error in that situation would be unlikely to notice it working (unless they are catching the exception). Of course, the latter would lose that backstop when their code is already broken. I still think having a 'disjoint_union' operator is enough of a hint that 'union' is not necessarily disjoint.

As for dropping union for update, compose currently does even more work than one constructor call and two calls to update, although I'm not sure why -- the two updates and the rest of the processing seem redundant.

Thinking about the issue some more, data is inseparable from all this. Disjoint union is easier than nondisjoint union because you're guaranteed not to have any data clash. compose isn't a very good union, because it isn't commutative (data is always taken from the right argument). intersection and symmetric_difference are made commutative by dropping all data. So simply removing the error from union would be incorrect without also changing how data is handled (scope-creep!).

More data-friendly operators would keep data, at a minimum propagating down to the dict and requiring equality for each common attribute, for nodes and edges of the same name (compose does the former). Alternatively, the operators could propagate down to each attribute. That would require allowing the user to specify attribute operator functions, perhaps as arguments to the graph operators. Operators could take a data parameter to distinguish among supported behaviors.

dschult commented 4 years ago

I think you mixed up R.graph.update(G.graph) which updates the R.graph "graph attributes" with R.update(G) which adds/updates the attributes, nodes and edges of R. The compose processing isn't redundant.

It is true that the compose operation handles node and edge attribute collisions in favor of the right (second) argument. But this is well documented. We do NOT treat nodes as equal only if their attributes are equal... To make that behavior work, you should construct your nodes to include all that info... that's one advantage of arbitrary hashable nodes. e.g. instead of node "a" having attributes {"color": "blue", "height": 5} you can make the node ("a", "blue", 5) if you want treating nodes to be equal only if their attributes are equal.

In fact, it gets even messier than our discussion has touched... There are in fact two types of disjoint union (labeled and unlabeled nodes). That's why we have it set up this way. The disjoint_union function treats the two graphs as unlabeled graphs (nodes have no identity). It doesn't matter if the nodes are equal or not. The resulting output is a graph that contains the two input graphs and no edges between them. The nodes are relabeled to make the structure work. The union function treats the two graphs as labeled graphs (nodes have identity). The resulting graph contains the two input graphs and no edges between them AND nodes have the same identity as they do in the original graph.

I suppose we should put much of this discussion in the docs in some coherent way...

But the main point of this issue is that people expect union() to be what we call compose(). The first step should be to add text to the error message whenever people input 2 graphs that share nodes. The error message should probably refer to the compose function as well as the disjoint_union function as well as the G.update method. That may be toomuch for one error message. The doc_strings of these should also probably be updated to refer to each other and explain the difference.

Changing the names of these functions is the next possible action to take, but I'm -1 without a more complete plan for how to proceed.

stevenganz commented 4 years ago

On the processing in compose, indeed the calls to R.graph.update would not be sufficient, but do seem to be superfluous given the later processing. If you have an example where you get different results without them, please let me know. Why not just substitute two calls to R.update for the whole thing?

Here's my current proposal: Rename union as disjoint_union, and fold the existing behavior of disjoint_union in where rename=None (without changing the default). Then, let union be a not-necessarily-disjoint union. union should treat data attributes consistently with other routines. As those other routines now stand, that means dropping data, but other behavior is possible. We could then deprecate compose, which would eventually make that name available for a more appropriate operator. Per my comments above, I don't believe that compose offers much beyond graph update, where biased merging of attributes is desired. Users should be warned to replace union with disjoint_union in their code, and to provide rename=None in calls to disjoint_union.

Justification:

The intention of my last post was that in addition to users often expecting union to be non-disjoint, they are likely to expect union, intersection, and symmetric_difference to be commutative, e.g. (A U B ) == (B U A), with the resulting graphs being equal iff they have the same node and edge names and the same attribute values, at least in the default behavior (trivially true if data is being tossed, and true for disjoint unions). I updated that post to be more precise.

I can't think of an application where tossing the data is the right thing to do. The advantage of tossing data, however, is that the user can merge the data as they like. The disadvantage is that the user doesn't get the benefit of the most likely desired data merge functionality.

If the decision is to toss data where a conflict is possible and let the user replace it, we should do that consistently, including for not-necessarily-disjoint union.

Side comments on merging data:

That's an interesting point about the node and edge names being any hashable values, but it's unlikely to lead to desirable behavior. Using the approach you suggest, taking a union of two graphs with equal node names would create a new node in the result graph for each combination of node name, attribute set, and attribute values, rather than keeping one node per node name and having union operator (or the user) merge the data in some way.

It would be genuinely useful to have the graph operators act on data and not just toss it. But when one gets into merging data in the operators, things get complicated fast, which is a rationale for not dealing with it. In case, now or in the future, there's stomach for taking that on, I'll share some of the possibilities for a general rule. Basically, likely options are to either require equality, always merge, or propagate the operator, for either of the attribute dict or the attribute values of common attributes. Some of the combinations make more sense than others, but I can't say I'm confident at this point of the best way to go. For union, it seems likely that desired behavior would be to slurp up all attributes and attribute value data from either graph. By merging, I mean performing a union on set or dictionary data (recurring on the values). For other cases (such as scalar and list values) we would require equality (although it would make sense to merge sorted lists). Alternatively and more generally, as I mentioned earlier, there's a design space around having the user specify a merge function for each attribute. Because of the lack of clarity, I suggest sticking with the proposal above for now, although I'd be open to considering arguments for any particular refinement along these lines.

dschult commented 4 years ago

I'm not convinced there is anything we need to do here other than add better documentation to union and probably other operators.

The current situation makes people who try union when they want compose run into an exception... which is a good thing to have happen. Especially if we had documentation, or a better exception message.

The word union is clearly defined in Graph Theory and we follow standard definitions wherever reasonable. The word compose sometimes refers to what we call lexicographic_product, but other than Harary's text, I haven't seen the use of compose in that sense. Certainly users will notice from the docs and the output of compose that they don't have a lexicographic_product if they use compose.

The idea that union and disjoint_union are the same ignores the distinction between unlabeled graphs where node identifiers are not important and labeled graphs where the node labels are important. It is a subtle distinction and would only be hidden if we combined these two functions.

So, I think we just need to add some documentation; especially updating it to account for the G.update method we now have available.

I'm -1 on removing union or combining it with disjoint_union.

stevenganz commented 4 years ago

Dan, your mind is clearly made up on this, but for the record, I need to restate that I believe the appropriate behavior for compose is a relational composition, not lexicographic. Also, I did not propose "removing union".

Are you in favor, then, of simplifying the existing definition of compose?

dschult commented 4 years ago

I actually don't think any of the code needs to be changed. The compose code is very readable and clear and the G.update code has more cruft -- harder to read and now that I look at it, might not retain edge keys for multigraphs. It seems like some rooting around would be needed to make sure it's all good... could be that new tests would show what is needed. All these changes are good things really... but I'd rather spend time on new code than replace working code.

How about we get a relational_composition() function instead of changing compose?

stevenganz commented 4 years ago

That would be useful. Probably relational_compose for consistency.

stevenganz commented 4 years ago

I can take a crack at coding it up, if you like.

dschult commented 4 years ago

It would be great if you can put together a function for relational composition. I'm not sure it fits with the other operators... so if you can't find a good match put it in its own module -- and that could be outside of the operators folder if that works better. Thanks!

stevenganz commented 3 years ago

I finally got a chance to work on this. Just submitted a pull request. I ended up just putting it in with the operators -- that makes the most sense to me. I had to make some assumptions (documented) in working with data attributes and multigraphs. See what you think.

rossbar commented 2 years ago

There's a lot of info in the above discussion so I went ahead and extracted a few of the suggestions that are actionable into separate issues to try to break things up and make it easier to get eyes on them. See the issue links above.

networkx / networkx

Misleading operator names #4208