Closed ckaran closed 7 years ago
In the examples you gave, what would you expect sim_ref
and a_ref
to look like in JSON when serialized?
@dtolnay Anchors would look like whatever YAML 1.2 looks like. Quoting the second paragraph under "Status of this Document":
The primary objective of this revision is to bring YAML into compliance with JSON as an official subset. YAML 1.2 is compatible with 1.1 for most practical applications - this is a minor revision. An expected source of incompatibility with prior versions of YAML, especially the syck implementation, is the change in implicit typing rules. We have removed unique implicit typing rules and have updated these rules to align them with JSON's productions. In this version of YAML, boolean values may be serialized as “true” or “false”; the empty scalar as “null”. Unquoted numeric values are a superset of JSON's numeric production. Other changes in the specification were the removal of the Unicode line breaks and production bug fixes. We also define 3 built-in implicit typing rule sets: untyped, strict JSON, and a more flexible YAML rule set that extends JSON typing.
Basically, any JSON parser will be able to parse YAML 1.2, the only difference is that a YAML 1.2 parser will know how to create links, whereas the JSON parser won't.
That said, there will likely need to be additional information within a document to identify that you're working with a RefCell
instead of a Weak
, etc.
Basically, any JSON parser will be able to parse YAML 1.2
This does not match my understanding of JSON and YAML. Could you show what you would expect this to print if we track cyclic object graphs in Serde?
println!("{}", serde_json::to_string(&sim_ref).unwrap());
println!("{}", serde_json::to_string(&a_ref).unwrap());
@dtolnay OK, I've spent the past couple of hours trying to convert my YAML to straight JSON. Then I went back and re-read the spec for YAML 1.2; I think I completely misunderstood what they meant by "The primary objective of this revision is to bring YAML into compliance with JSON as an official subset." As near as I can tell, there is no way to convert the following YAML 1.2 into any form of JSON. My understanding is that they had an official schema to map YAML 1.2 onto JSON, but I can't find any reference to that. So, I apologize, unless serde
develops its own schema, the following can't be converted.
sim_ref: !!Rc
arg: !!RefCell
arg: &id001 !!Simulator
clock: !!Clock
date: !!Duration {nanos: 0, secs: 0}
robots:
- !!Robot
name: A
simulation: !!Rc
arg: !!RefCell
arg: *id001
- !!Robot
name: B
simulation: !!Rc
arg: !!RefCell
arg: *id001
- !!Robot
name: C
simulation: !!Rc
arg: !!RefCell
arg: *id001
a_ref
would be:
a_ref: !!Rc
arg: !!RefCell
arg: &id001 !!Cycle
name: name
self_pointer: !!Option
arg: !!Rc
arg: !!RefCell
arg: *id001
Note that I created the above by using the following python script (tested under python 3.6.3):
import yaml
class Duration(object):
"""docstring for Duration"""
def __init__(self):
self.secs = 0
self.nanos = 0
class Clock(object):
"""docstring for Clock"""
def __init__(self):
self.date = Duration()
class Rc(object):
"""docstring for Rc"""
def __init__(self, arg):
self.arg = arg
class RefCell(object):
"""docstring for RefCell"""
def __init__(self, arg):
self.arg = arg
class Robot(object):
"""docstring for Robot"""
def __init__(self, sim, name):
self.simulation = Rc(RefCell(sim))
self.name = name
class Simulator(object):
"""docstring for Simulator"""
def __init__(self):
self.clock = Clock()
self.robots = list()
def add_robot(self, name):
robot = Robot(self, name)
self.robots.append(robot)
class Option(object):
"""docstring for Option"""
def __init__(self, arg):
self.arg = arg
class Cycle(object):
"""docstring for Cycle"""
def __init__(self, name):
self.name = name
self.self_pointer = Option(Rc(RefCell(self)))
if __name__ == '__main__':
sim = Simulator()
sim.add_robot("A")
sim.add_robot("B")
sim.add_robot("C")
sim_ref = Rc(RefCell(sim))
toplevel = {"sim_ref": sim_ref}
print(yaml.dump(toplevel))
cycle = Cycle("name")
a_ref = Rc(RefCell(cycle))
toplevel = {"a_ref": a_ref}
print(yaml.dump(toplevel))
The import yaml
clause refers to PyYAML 3.12, which you can get with pip install PyYAML
.
Thanks for the clarification. I am closing the issue because I don't think Serde is the right place to experiment with designing a schema to represent cyclic data structures in a format-agnostic way, and because I don't think the overhead of the HashSet suggestion is justifiable for typical usage of Serde. I would be interested to see this prototyped in a YAML-specific library for serializing cyclic data structures to YAML and I would be happy to help review the implementation if anyone decides to try it.
@dtolnay I'm willing to give it a shot but with the following caveats:
Sounds good. To start, I would recommend defining a trait that can convert Self into an enum like:
enum Yaml {
// all the variants copied from serde_yaml::Value, as well as:
Anchor(usize, Box<Yaml>), // &id001
Alias(usize), // *id001
}
Then handwrite some implementations of the trait for various interesting types. I can provide feedback on all of that. After that you could start on the derive
logic and the printer to render one of these enums as YAML text.
@ckaran FWIW I have serialization of acyclic graphs implemented here (serialize and deserialize). This part is fairly readable I think so it might help, note that it does use https://github.com/Marwes/serde_state but the same thing could (mostly) be done with thread-locals and normal serde.
I actually also have one place where I handle cyclic graphs but it is very specialized for my use case so it might not be very useful https://github.com/Marwes/gluon/blob/3dd9f49b57a81d4bcfa7ca81ad205bef5fa3bbad/vm/src/serialization.rs#L458-L582
@dtolnay @Marwes this is @ckaran, but on my home account. I've started my own marshaling library at https://github.com/cfkaran2/marshal. There is absolutely nothing there at the moment, but I plan on filling it in with what I'm planning on doing shortly.
I'd also like to add both of you as collaborators so you have push access to the repository. I expect to basically fall off the face of the earth for a while once my wife goes into labor, but I'd like to make sure that if you want to push your changes, you can. Are you interested in becoming collaborators?
Yes, if you are planning to tackle marshalling of cyclic data structures to YAML then I would like to be involved.
The problem
As I understand it,
serde
currently converts all object graphs into trees; that is, if you have aRc
(orArc
, orWeak
, etc.) in astruct
, then the serialization process will try to follow all of the pointers and make copies of them to the output (see #194). This is a serious problem that was 'solved' by refusing to serialize anything that has interior mutability (see #179); the issue is that there are always going to be cases where you can't break a cycle in an object graph, no matter how much restructuring you do, and in some cases, you really, really need interior mutability or your code won't work. A toy example is below:Another toy example, this time involving cycles:
In the first example, when the global clock is advanced all of the robots will be able to learn about it lazily; without interior mutability, the only way to distribute the change is to walk through the vector of robots and update their individual clocks. That is clearly suboptimal. However, if serde doesn't handle object graphs correctly, then it makes it impossible to take snapshots of running simulations; the clock will not be deserialized correctly.
The second example shows that while it is difficult to create self cycles in rust, it is possible. I don't claim that this is useful, only that it is possible, and it would ideal if
serde
could transparently handle it.A possible solution
The simplest solution that I can think of involves depth first search,
std::collections::HashSet
, and the following function:The algorithm is simple; whenever a new object is being serialized, find its address via
address_of()
and put that into theHashSet
. Continue serializing the rest of the object. If you encounter some kind of pointer, check to see if the address of the object it points to is in theHashSet
; if it is, then use the address as link. If not, recurse into the pointer. Since this is depth-first search, you are guaranteed that it will take linear time in the number of objects to terminate, and you will never visit the same object more than once.Caveats and possible issues
serde
has provisions for this. Alternatively, theHashSet
could be a global with astd::sync::Mutex
protecting it. The only issue with this is thatserde
then becomes a bottleneck for all serialization, which would be a shame. (You need the mutex to ensure that theHashSet
is empty before and after a serialization run is complete, or you'll end up missing objects in the graph).serde
to make copies. I can't think of too many cases where this would happen though.std::rc::Rc
and similar pointer types; despite a fair amount of trying, I can't figure out ifaddress_of()
can be made to return the address of the object thatstd::rc::Rc
is pointing to; in my tests, I haven't been able to make that happen. However, I've only been programming in rust for about 5 weeks now, so it's likely that someone that really knows what they're doing could provide a better solution that really works.In C,
&stuff == &inner
. If you start serializing atinner
, then when it's time to check theHashSet
to see if you've serializedstuff
, you will find the address in there and fail to serializestuff
. On the other hand, this could be viewed as a programmer error, and could just be ignored.Any thoughts on this would be appreciated; I know that rust's claim to fame is 'fearless concurrency', but that isn't what attracted me to it;
serde
is. Having a trivial way of taking snapshots of my simulations would save me significant effort over having to deal with it in C or other languages.