igraph / igraph

Library for the analysis of networks
https://igraph.org
GNU General Public License v2.0
1.73k stars 404 forks source link

Serialization/deserialization handlers for IGRAPH_ATTRIBUTE_OBJECT? #1317

Open dwintergruen opened 5 years ago

dwintergruen commented 5 years ago

I have an edge attribute where some of the values where integers some floats.

This attribute is not save in the graphml file.

In principle, I think this is ok, but I would appreciate a warning.

ntamas commented 5 years ago

Thanks for the heads up! I won't have time for this in the foreseeable future, but I have added labels to denote that I am happy to accept PRs for fixing this.

ntamas commented 4 years ago

Cannot reproduce; this is what I've tried:

from igraph import Graph
g = Graph(3)
g.vs["value"] = [4, 1.7, None]
g.write_graphml("test.graphml")

and this is what I've got:

<?xml version="1.0" encoding="UTF-8"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns
         http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
<!-- Created by igraph -->
  <key id="v_value" for="node" attr.name="value" attr.type="double"/>
  <graph id="G" edgedefault="undirected">
    <node id="n0">
      <data key="v_value">4</data>
    </node>
    <node id="n1">
      <data key="v_value">1.7</data>
    </node>
    <node id="n2">
    </node>
  </graph>
</graphml>

The vertex attribute is clearly there. It even works if I use a NumPy float in one of the slots. The only way I can reproduce the bug is if I mix strings and numbers within the same vertex or edge attribute. When that happens, the Python interface reports to the C core that the attribute is of type IGRAPH_ATTRIBUTE_PY_OBJECT, which of course the C interface cannot save into a GraphML file as it knows nothing about Python objects.

We can add a check in the C layer - whenever the C layer tries to serialize a list of attributes with type IGRAPH_ATTRIBUTE_PY_OBJECT or IGRAPH_ATTRIBUTE_R_OBJECT, it could print a warning via the standard warning handler. However, this needs modifications on the C side so I'm transferring the issue there.


On a somewhat unrelated note, I think it was a mistake in the past to single out Python and R objects in the C core in the attribute type enum. We should have simply called these IGRAPH_ATTRIBUTE_OBJECT, and we could have allowed a higher-level interface to register serialization and unserialization handlers for objects of the host language so these could still be saved in a GraphML file as strings. I leave this for consideration in igraph 0.9.0.

ntamas commented 4 years ago

We could fix the original issue in 0.8.1, though (i.e. throw a warning for attribute types that the C core cannot serialize).

iosonofabio commented 4 years ago

I have a little time, what do you wanna do with this? If this is critical, let's add the IGRAPH_ATTRIBUTE_PY_OBJECT and IGRAPH_ATTRIBUTE_R_OBJECT checks in the C serialization and then throw a warning.

If this is not critical, let's remove the 0.8.3 tag and slap a 0.9 tag on it, and work towards the IGRAPH_ATTRIBUTE_OBJECT solution which seems more viable.

ntamas commented 4 years ago

Agree, let's do this in 0.9.

szhorvat commented 2 years ago

Renamed this to reflect the actual discussion.

Personally I'm not sure that a plain string serialization / deserialization interface is sufficient. If we wanted to do that, can't the attribute handler just pretend that custom attributes are of a type IGRAPH_ATTRIBUTE_STRING? Doesn't the R interface do something like this already? Then the current foreign format readers/writers would just work, with no changes needed in the C core.

Several formats, such as GraphML, GML, JSON are capable of representing hierarchical data. If we want proper support for them, the serialization / deserialization interface would need to work with some sort of hierarchical representation instead of just strings.

Or perhaps the interface would generate strings that would be inlined into the files directly, and serializers/deserializers would be customized on the high level language side for each format? But that is way too much trouble for parsers, it's not realistic.

In the end, I think there is no point in making it possible to write in a GraphML/GML format-variant that can only be read back by igraph itself. There are already solutions for storing data in such a system-specific manner (pickles in Python, and the standard export format of R, whatever that's called). So if we do choose to do this, it should be done in a way that allows interchange with other systems (not just reading back to igraph). That requires support for some sort of hierarchical data representation.

It looks like to me implementing this feature will be API-breaking, but we won't have the capacity to do it well before 1.0, and I'd rather not do it badly. I expect this for 2.0 the earliest ..

ntamas commented 2 years ago

If we wanted to do that, can't the attribute handler just pretend that custom attributes are of a type IGRAPH_ATTRIBUTE_STRING?

No, it can't, at least not without surprising the user big time. Consider the following:

g = Graph([(0,1), (1,2), (2,3)])
g.es["weight"] = [4, 5, 6, None]

In Python, the type of the weight attribute is currently reported to the C core as IGRAPH_ATTRIBUTE_OBJECT now; it can't be numeric because we have the None instance there. Reporting it as IGRAPH_ATTRIBUTE_STRING would mean that numeric weights are silently converted to strings when saving the graph as GraphML. Right now at least we print a warning that the attribute type cannot be serialized.

Several formats, such as GraphML, GML, JSON are capable of representing hierarchical data. If we want proper support for them, the serialization / deserialization interface would need to work with some sort of hierarchical representation instead of just strings.

I don't think that's possible to implement without the serializer having to know about what sort of format it is serializing into. The way it would probably work if we started implementing this is that higher-level interfaces would be given a function that they can call to register serializers / deserializers for arbitrary object types; the C core would then call these functions when it encounters an attribute of object type, and request the serializer / deserializer to provide a representation for a given value. That's fine, but that representation would probably look differently in GraphML, GML, JSON and so on, so in the end the C core would need to tell the higher-level interface "hey, here's this object, I want to put it in a GraphML file, what shall I write there?". At that point it's going to be a mess, especially if the object being written has some kind of standard representation in GraphML but it needs a namespace declaration at the top of the file, which has already been written.

The best thing I can come up with to keep serializers format-independent is if the serializer can only produce a string representation that somehow contains all the information needed to reconstruct the original object later, but in a language-specific way. Pickles in Python are a good example for this. Yes, it's true, at that point the written representation will not be independent of the host language, but there might still be value in it; you could still read the graph with other tools as long as you don't need the values of these attributes, or you could ask igraph to produce the GraphML file and then you could post-process it to replace the serialized implementation with whatever XML fragments you want to put there, using another script written in the same high-level language.

Another alternative that I could think of is that it would be the user's responsibility to provide serializers for custom data types, and this would become part of the signature of the GraphML / GML writer function (i.e. there would be an argument where the user can supply a serializer function). This is what Python's json module does; it does not attempt to serialize arbitrary objects that cannot be represented in JSON but simply relies on user-provided callbacks that must return other objects that are representable in JSON.