ilmoeuro / snakeyaml

Automatically exported from code.google.com/p/snakeyaml
0 stars 0 forks source link

Stream-like processing of YAML documents #32

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Currently reading and writing a document in a stream like fassion only
works on the event layer. Nodes and Objects are processed one document at a
time. This is a problem for me as I have huge documents which I do not want
to have in memory at once.

I could use the events, but then I have to do all the remaining work by my
self. 

YAML has the nice property that a partially read document has a valid
structure. Maybe this can be exploited.

Original issue reported on code.google.com by smurn....@gmail.com on 13 Nov 2009 at 5:05

GoogleCodeExporter commented 9 years ago
Due to anchors-aliases the whole document must be constructed at once.

Who shall decide how to split the document ? SnakeYAML or the user ?

Can you please provide an example ?

Original comment by aso...@gmail.com on 13 Nov 2009 at 5:41

GoogleCodeExporter commented 9 years ago
> Due to anchors-aliases the whole document must be constructed at once.

No. The spec garantees that the anchors come before the aliases. The 
implementation
would need to keep track of the anchors it has seen so far and keep a reference 
to
the corresponding object.

> Who shall decide how to split the document ? SnakeYAML or the user ?

What do you mean by split?

> Can you please provide an example ?

I don't have a clear idea how this should look like. I just don't like the idea 
of
loading large documents into memory just to iterate over them.

In my specific situation the root node is a huge sequence (~10^7 elements). The
elements of the sequence are small. 

There's one idea that would work for this. If it would be possible to use 
events and
composer/constructor together. My code could then look something like this:

parser.getEvent() // StreamStartEvent
parser.getEvent() // DocumentStartEvent
parser.getEvent() // SequenceStartEvent
while(!parser.checkEvent(SequenceEndEvent)){
  Object obj = constuctor.getData();
  // do something with obj
}
parser.getEvent() // SequenceEndEvent
parser.getEvent() // DocumentEndEvent
parser.getEvent() // StreamEndEvent

Here constuctor.getData() would call the composer which in turn would read all 
the
events from parser that belong to the next node.

This would work nicely for me. But it's quite specific for my problem. I was 
hoping
for a solution that would be of use to a bigger audience.

Original comment by smurn....@gmail.com on 13 Nov 2009 at 8:05

GoogleCodeExporter commented 9 years ago
>The implementation would need to keep track of the anchors it has seen so far 
and 
keep a reference to the corresponding object.

If you create and keep the objects then you consume the same resources as with 
the 
complete construction.

> What do you mean by split?
I expected you wish to cut the YAML document into pieces to create them one by 
one

>In my specific situation the root node is a huge sequence (~10^7 elements). The
elements of the sequence are small. 
Then simply create nodes. It is much simpler to use then working with events.

Original comment by py4fun@gmail.com on 16 Nov 2009 at 9:55

GoogleCodeExporter commented 9 years ago
> If you create and keep the objects then you consume the same resources 
> as with the complete construction.

No, most files have very few anchors defined.

> Then simply create nodes. 

I'm not sure if I understand you correctly, do you propose to use mutliple 
documents
in the same file? That's what I'm doing currently, trouble is, that I have to
reference objects across documents, and anchors only work within a document.

There's another problem I haven't seen so far. If there's an option to read 
documents
in a stream like, there should also be a way to write them like a stream. But 
then
every node needs an anchor because we can never know which node we will see 
again. If
we give every node an anchor, stream like reading make no sense no more because 
we
need to keep a reference in memory for everything. 
This makes processing of huge documents partically impossible anyway.

Also, I had a look at the code. I don't think that what I need could be 
implemented
in snakeYAML without a major refactoring. IMHO not worth the trouble.

Suggesting to close the issue.

Original comment by smurn....@gmail.com on 22 Nov 2009 at 4:13

GoogleCodeExporter commented 9 years ago

Original comment by py4fun@gmail.com on 23 Nov 2009 at 8:47