Opaque storage of igraph objects

krlmlr commented 1 year ago

Before #768.

Currently, an igraph object is a list of length 9 or so in R land. For the future, it would be better to store an external pointer to the C object.

We can use external pointers, but these are not serializable. Proposing the following solution:

A serialized representation with contiguous vectors of vectices and edges in igraph's native format is stored as a raw() vector
In addition, this vector stores an external pointer with the C object

The external pointer will be gone in serialization, but this doesn't matter -- the graph can always be reconstructed from the data in the raw() vector.

In addition, to avoid doubling the memory consumption, we can make the raw() vector "lazy" with ALTREP. This works only for R >= 4.0, but because it is only an optimization, we retain compatibility with R versions that don't have it yet.

szhorvat commented 1 year ago

I don't fully understand the details, so forgive me if some of the following comments miss the mark.

A serialized representation with contiguous vectors of vectices and edges in igraph's native format is stored as a raw() vector

Do you mean storing a bitwise representation of igraph's native graph object?

One problem with that is that it ties the versioning of R/igraph objects to C/igraph, which IMO creates too strong a dependency. Why not store just the minimum about of information that is required to reconstruct the graph? That would be a simple edge list, as returned by igraph_get_edgelist(), plus the number of vertices. It's even fine to store these as R's native doubles, to maintain the stability of the serialization format.

Regarding just using doubles:

If we go with 32-bit mode, doubles can store any integer. If we go with 64-bit mode, they can't. But then R can't handle the full 64-bit range of indices anyway, due to its deep reliance on doubles, so in that case it might be cleaner to restrict graph sizes to the representable range (53 bits IIRC)? We can investigate if this is possible by setting IGRAPH_INTEGER_MAX / IGRAPH_VCOUNT_MAX / IGRAPH_ECOUNT_MAX differently.

Do I understand correctly that it's possible to make it so that the in-memory igraph_t and the serialization rarely need to be synchronized? If so, the above should work.

If you did not mean a bitwise concatenated copy of the igraph_t, but instead a bitwise copy of what igraph_get_edgelist() returns, one problem is:

The R/igraph format is versioned so that graph data saved with old versions can be used in new version (by automatically translating the format). Raw bits will not be easy to translate to new versions. It's better to use a number format that R understands natively.

There's the issue of the cache. It would ideally be maintained even when the igraph object is copied internally. How are you planning to handle this?

Maintaining the cache is IMO not necessary when saving the graph to a file.

Note that the cache is designed so that it will be possible to extend it in the future without breaking ABI. Thus it probably should not be part of the serialization that is written to files, so changes in the C core wouldn't force a change of the serialization format in R.

There's also the issue of attributes, which do need to be stored as native R objects.

This works only for R >= 4.0, but because it is only an optimization, we retain compatibility with R versions that don't have it yet.

From past experience, I would suggest not creating additional constraints for ourselves by trying to maintain compatibility with outdated versions. People who who use old R versions can use old igraph versions as well.

igraph has accumulated several non-ideal design decisions over the years. We are currently in a process of modernization, and trying to create a new foundation on top of which we can build in the next several years. What we need is as few constraints as possible so we can keep moving quickly, and reach a new stable design as soon as possible. We need to take a long view here, and think about what needs to be done now so users can be well supported within the entire timespan of the next 10 years, not just today. In a few years, R 3.x will be completely irrelevant.

krlmlr commented 1 year ago

Thanks for the detailed response.

Indeed, it looks like we can go with storing the edgelist, directedness, the number of nodes, and the list of attributes. If we keep storing the edgelist as doubles, this means that we can drop the following components of the internal structure: igraph_t_idx_oi, igraph_t_idx_ii, igraph_t_idx_os, igraph_t_idx_is . Agree with not storing the cache in R land.

Like today, this structure will be created from the in-memory igraph_t every time a new graph is returned from the C core. The difference is that the edgelist doesn't ever have to be materialized in R land, only in the (rather rare) case of serialization of the R object. The materialization can happen via ALTREP on R >= 4.0. The in-memory igraph_t (and all its caches and auxiliary information) is held by the R data structure and only released when it's garbage-collected, and treated as completely opaque by the R code. It's also not available after deserializing the R object, but then it will be recreated automatically from the edgelist, and stored in a mutable part of the data structure (perhaps the igraph_t_idx_env component).

The only downside I see with this approach is that we need four ALTREP vectors instead of just one, but we can handle that.

Let's experiment with IGRAPH_INTEGER_MAX / IGRAPH_VCOUNT_MAX / IGRAPH_ECOUNT_MAX .

Currently, imposing a minimum R version makes it more difficult than necessary for the users, but the case of R 3.6 is debatable indeed. We can start with ALTREP only, requiring R >= 4.0, and add support for R < 4.0 if requested. This would be a fairly localized fallback that's easy to remove later.

szhorvat commented 1 year ago

You asked me in chat to continue commenting here, so here it goes.

I think thinking about the tradeoffs of using 32 or 64-bit integers on 64-bit systems (on 32-bit systems of course we use 32-bit).

There is one (major) advantage to 64-bit: it is possible to work with larger graphs. The disadvantage is that there will be more copying, and that memory use will increase. Let's examine what we'd need to copy that we don't have to with 0.9.

Going from C to R, everything is copied in the current version. Question: @krlmlr, is it possible, in principle, to implement a copy-on-write approach with ALTREP? I.e., is it possible to expose an igraph_vector_t as an R vector, and delay copying its data until it actually gets modified?

Going from R to C, we mostly deal with the following type of quantities:

Real vectors, such as weights. These do not need to be copied, and nothing will change for 0.10.
igraph_t. If this is handled as you suggested, it won't matter if we're using 32-bit or 64-bit mode.
Vectors of indices. These need to be copied, regardless of whether they are stored as reals or integers, because we need to convert to zero-based indexing. It doesn't matter if we're using 32-bit or 64-bit mode.

So unless I'm missing something, once the issue with copying igraph_t is solved, there are no more big obstacles to using 64-bit. This is what you suggested as well @krlmlr, but I wanted to have all this written down.

krlmlr commented 1 year ago

Thanks. I don't understand the cow question, let's discuss on Friday.

szhorvat commented 1 year ago

Not sure I can make it on Friday. I am sick and completely lost my voice.

I read up very briefly on ALTREP, and the example they give in the documentation is that it is possible to have multiple implementations for vectors. Ranges like 1:n can be represented compactly by storing their endpoint, but as soon as the vector is modified (e.g. by removing an element or just doing some arithmetic), its internal representation is changed to store all elements explicitly.

Is it possible for us to define a new representation for R vectors, and return results from igraph using this reprsentation? This representation would wrap a read-only igraph_vector_t, avoiding the need to copy the data stored within. As soon as the vector is modified, it would be converted to a standard representation.

Is this possible at all?

krlmlr commented 1 year ago

What use case do you have in mind?

szhorvat commented 1 year ago

The purpose is to delay copying results produced by igraph until absolutely necessary. I am referring to results which come in the form of a vector of doubles. Some of these can be huge, think e.g. of a distance matrix.

Maybe I'm misunderstanding something, and copying would be necessary anyway.

@ntamas is experimenting with something similar for the new Python interface, based on NumPy arrays. Avoiding copying when passing data to igraph is generally easy. But can we avoid copying when taking result from igraph? It seems that NumPy arrays have a special feature where they can wrap an existing read-only in-memory array, and delay copying that data until the array is modified. Can we do the same with ALTREP, can we implement a vector representation that doesn't copy until needed?

krlmlr commented 1 year ago

We might be able to implement deferred/lazy materialization for matrices. I'm not sure it's worth doing for O(V) or O(E) sized vectors.

krlmlr commented 1 year ago

We discussed most issues today, the new issue has a more up-to-date description and plan and points here. Let's keep ALTREP for deferred computation on the back burner.

github-actions[bot] commented 6 months ago

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.

igraph / rigraph

Opaque storage of igraph objects #783