support partial representation when document contains unrecognized tags

GoogleCodeExporter commented 9 years ago

Please support loading documents with unrecognized tags, either by default or 
as an option that 
can be turned on when constructing a new Yaml instance. This is similar to 
[Issue 31][1].

Steps to Reproduce
1. generate a YAML document containing application specific tags
2. construct a YAML instance with default constructor only
3. attempt to load the document
4. load will fail with an error

Expected Output:
I would prefer that SnakeYAML will load the document as a [partial 
representation][2] per the 
spec, using the node's kind to resolve it as seq, map, or str as appropriate.

Actual Output:

An error:
    The document is not valid YAML:
    could not determine a constructor for the tag '...'
    ...

What version of the product are you using? On what operating system?
SnakeYAML-1.2, OS X 10.6.1 and CenOS Linux

[1] http://code.google.com/p/snakeyaml/issues/detail?id=31
[2] http://yaml.org/spec/1.1/#partial%20representation/

Original issue reported on code.google.com by toolbea...@gmail.com on 8 Dec 2009 at 1:22

GoogleCodeExporter commented 9 years ago

Additional information on my motivation for requesting this:

I don't always control the source of the YAML. There are cases where I receive 
the YAML document from 
another producer, often in another language or using a different library. This 
document may contain tags 
specific to the producer application that my application doesn't understand or 
cannot (in the case of some 
output from other languages). However, the partial representation is still 
useful to my application.

Currently the only way to consume this document with SnakeYAML is manually code 
a case for each of these 
unknown tags. Again, this assumes somewhat that I have control over the 
producing application, e.g. I know 
when a new tag is introduced. Again, where the producing application is out of 
my control, the introduction 
of a new tag breaks my consuming application until I add code to handle the new 
tag.

Not supporting partial representation severely hampers the utility of SnakeYAML 
for doing interop with other 
systems and languages.

Original comment by toolbea...@gmail.com on 8 Dec 2009 at 1:34

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Excuse me, since I do not see the difference with issue 31, I will just repeat 
the
arguments.

The spec says:
A complete representation is required in order to construct native data 
structures

It clearly indicates that Java objects (native data structures) cannot be 
created for
partial representation.
Please note that partial representation is supported in SnakeYAML - use low 
level API
to produce nodes and then create Java objects on your own.

The logic "unknown tag" -> "str" is application specific. If you wish to 
implement it
you can have a look at BaseConstructor.getConstructor(Node) to find a solution 
which
works for you.

If you do not have control over the incoming documents what happens when there 
is a
global tag with a class name you do not have this class in your classpath ?

Original comment by aso...@gmail.com on 8 Dec 2009 at 12:26

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

@asomov, where I work, it is considered impolite to reopen closed issues, hence 
I opened this issue. If it is 
more appropriate to ask that issue 31 be reopened, please let me know and I 
will move my discussion there.

"It clearly indicates that Java objects (native data structures) cannot be 
created for
partial representation."

By this do you mean Java objects other than List, Map, and String?

"The logic "unknown tag" -> "str" is application specific."

Are you interpreting the spec as saying all unknown tags must be assumed to be 
"str"? If so, I believe that is 
incorrect. The spec says, "In such a case, the YAML processor may compose an 
partial representation, based 
on each node’s kind and allowing for non-specific tags." I interpret that to 
mean that SnakeYAML can 
compose a representation consisting of List, Map, and String objects. Though 
not as useful as a complete 
representation, such a representation still has utility.

"Please note that partial representation is supported in SnakeYAML - use low 
level API to produce nodes and 
then create Java objects on your own...If you do not have control over the 
incoming documents what happens 
when there is a global tag with a class name you do not have this class in your 
classpath ?"

I may have misunderstood the goals of SnakeYAML. Is it primarily for the 
serialization of native Java object 
graphs from and back into Java?

I assumed one goal was to be a general purpose YAML processor. For such a 
processor, it should be 
convenient to consume YAML produced by other systems and programming languages. 
I made this 
enhancement request after having some difficulty consuming YAML produced by 
Perl's YAML::Syck which 
emits YAML containing tags for native perl objects, as in this example:

    perl -MYAML::Syck -e 'print YAML::Syck::Dump(bless { a => 1 }, "My::Perl::Class")'
    --- !!perl/hash:My::Perl::Class 
    a: 1

In my case, consuming the YAML produced by Perl as a simple data structure of 
sequences, maps, and scalars 
is of sufficient utility that I have no need to implement equivalent model 
objects in Java as exist in my Perl 
application.

I stand by this enhancement request. Having to resort to the low level API is a 
shortcoming for a general 
purpose YAML processor.

That said, for my current work I will look to the low level API as I should be 
able to work around the issue I'm 
currently facing. Thank you for that advice.

Original comment by toolbea...@gmail.com on 8 Dec 2009 at 2:32

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

I do not mind to open a new issue each time. 

"By this do you mean Java objects other than List, Map, and String?"
I mean anything which extends java.lang.Object

"Are you interpreting the spec as saying all unknown tags must be assumed to be 
"str"?"
No, "unknown tag" -> "str" was given as an example.

"it should be convenient to consume YAML produced by other systems and 
programming 
languages"
I completely agree with this statement.

Look, when you give this YAML document:
--- !!perl/hash:My::Perl::Class 
a: 1
...
to Python, Ruby or VisualBasic do you expect it to work or to fail ?
(I am sure it will fail !)
If Python would fail why suddenly you expect that Java shall work ?

I propose the following solution:
1) the PERL parser shall not emit language-specific tags. Instead it should emit
--- !MyClass 
a: 1
...
Then it is easier to consume the document by other parsers

2) take a look here, it might help:
http://code.google.com/p/snakeyaml/source/browse/src/test/java/org/yaml/snakeyam
l/rub
y/RubyTest.java

3) since this is a second request I have added an example to ignore tags:
http://code.google.com/p/snakeyaml/source/browse/src/test/java/examples/IgnoreTa
gsExa
mpleTest.java
Please take a look and let me know whether it is close to what you want to 
achieve.

Original comment by py4fun@gmail.com on 8 Dec 2009 at 5:21

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Original comment by py4fun@gmail.com on 10 Dec 2009 at 9:47

Added labels: Type-Enhancement
Removed labels: Type-Defect

GoogleCodeExporter commented 9 years ago

So here's one workaround I came up with (Groovy JUnit test attached). It 
amounts to this custom Constructor:

class TagIgnoringConstructor extends Constructor {
    protected Object callConstructor(Node node) {
        switch (node.getNodeId()) {
            case "scalar":
                node.tag = "tag:yaml.org,2002:str"
                break;
            case "sequence":
                node.tag = "tag:yaml.org,2002:seq"
                break;
            case "mapping":
                node.tag = "tag:yaml.org,2002:map"
                break;
        }
        return super.callConstructor(node)
    }
}
...

Yaml yaml = new Yaml(new Loader(new TagIgnoringConstructor()))
yaml.load(...)

Is there a more polymorphic way to do this? I'd like to apply the "replace 
switch with polymorphism" 
refactoring, but didn't find anything in the JavaDoc to facilitate that.

Original comment by tim.tay...@eprize.com on 15 Dec 2009 at 11:38

Added labels: ****
Removed labels: ****

Attachments:

YamlHelperTests.groovy

GoogleCodeExporter commented 9 years ago

"So here's one workaround I came up with (Groovy JUnit test attached)..."

Which I now see is similar to the example referenced in comment #4. In both 
cases, the switch statement is a 
code smell. Is it possible to replace that with polymorphism?

Original comment by tim.tay...@eprize.com on 15 Dec 2009 at 11:44

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

I consider example from comment 4 and my similar one in comment 6 to be 
workarounds for a limitation of 
SnakeYAML. It should be easier to use SnakeYAML as a general purpose YAML 
processor. I shouldn't have to 
subclass a new Constructor to coerce it to load a YAML document as a simple 
data structure of sequences, 
maps, and scalars.

YAML tags are similar to XML's xsi:type attribute. Any XML parser can parse the 
following even if it doesn't 
understand what type "foo" is:

<a xsi:type="foo" value="1"/>

Higher level tools, such as XStream, build on top of general purpose XML 
parsers to provide Java object 
serialization.

Based on the current implementation, SnakeYAML is more like the YAML equivalent 
to XStream. It cannot be 
easily used as a general purpose YAML processor. Here is what I would consider 
easy:

Yaml yaml = new Yaml();
yaml.setIgnoreTags(true);
yaml.load(...);

One line of additional code, instead of several, would allow me to use 
SnakeYAML as a general purpose YAML 
processor the same way I can easily use any XML library in a general purpose 
way.

"I propose the following solution:
1) the PERL parser shall not emit language-specific tags. Instead it should emit
--- !MyClass 
a: 1
...
Then it is easier to consume the document by other parsers"

You assume that I control the code producing the YAML. In my case, I do. But 
it's legitimate to want to use 
SnakeYAML to parse YAML that you have no control over.

I believe this is a valid enhancement request. Using SnakeYAML for general 
purpose YAML processing 
shouldn't take a back seat to using it for native Java object 
serialization/deserialization. General purpose 
processing should at least have equal weight.

Original comment by tim.tay...@eprize.com on 16 Dec 2009 at 12:25

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

I think I should blog on this topic if it causes so much misunderstanding.

1)
Can you please give a definition of a "general purpose YAML processor" ?
What I see now is that "general purpose YAML processor" is SnakeYAML with
'yaml.setIgnoreTags(true)' implemented.

2)
>I shouldn't have to subclass a new Constructor to coerce it to load a YAML 
document
>as a simple data structure of sequences, maps, and scalars.

You do not have to !!! You only need to do it when the input is inconsistent or 
you
do not know how to make it consistent.

3)
YAML tags are similar to XML's xsi:type attribute. Any XML parser can parse the
following even if it doesn't understand what type "foo" is <a xsi:type="foo" 
value="1"/>

SnakeYAML can also parse a valid YAML document. Please check the low level 
'parse()'
method.

4)
>You assume that I control the code producing the YAML

I do not. I simply state that YAML producers should be aware that the content 
they
generate can be consumed by different parties

5)
>I believe this is a valid enhancement request

I completely agree. Provided that we understand the request, its implementation 
and
its consequences

6)
>Using SnakeYAML for general purpose YAML processing shouldn't take a back 
seat...

If "general purpose YAML processing" for you is like XML parser please use low 
level
parsing. Like XML parsing it simply provides naked Strings or Lists.

7)
>Here is what I would consider easy:

>Yaml yaml = new Yaml();
>yaml.setIgnoreTags(true);
>yaml.load(...);

I consider this completely unclear.
Which tags do you propose to ignore ? If a tag is perfectly valid should it be
ignored ? Should we also ignore implicit types (123 -> int)?
What happens when we got this (should it be a String or Integer ?):
---
!!int 123
...
Should we raise an error in this case:
---
!!map [1, 2, 3]
...

What are the criteria for a tag to be ignored if this method is introduced?

8)
Let us see an example with XStream.

<person>
  <firstname>Joe</firstname>
  <nonsenseString>aaa</nonsenseString>
  <nonsenseInt>123</nonsenseInt>
</person>

The 'Person' JavaBean does not have nonsenseString and nonsenseInt properties. 
Please note that I do not want to simply parse the XML (which is no problem of
course) but I wish to create a statefull 'Person' instance. Can XStream create 
such
an instance ?
Do you consider XStream as a "general purpose XML processor" ?

I am afraid you need to explain the request taking into account all the 
consequences.
Now I see the following: I wish to drop any trash to SnakeYAML and it must be 
able to
create a valid Java instance anyway.

Original comment by aso...@gmail.com on 16 Dec 2009 at 9:44

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Are you talking about local tags ? If only local tags are ignored does it solve 
the 
problem ? (global tags and implicit types work as usual)

Original comment by py4fun@gmail.com on 17 Dec 2009 at 8:22

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

I'm not explaining myself well or clearly. I will try to summarize my position 
a different way. Then I'll respond 
to comment 9 and comment 10 above.

Summary
=======

I consider these three use cases for YAML of near equal value:

a) Dump native data to YAML.  Load YAML back to equivalent native data 
structure.

b) Dump native data to YAML. Load YAML to structured representation with as 
many native types as possible, 
but not necessary all native types.

c) Dump native data to YAML. Load YAML to structured representation made up of 
List, Map, and String

SnakeYAML succeeds at making all three use cases possible. However, only (a) 
can be done with the high level 
API. I consider that a shortcoming.

According to the spec, only (a) is "complete success". Use cases (b) and (c) 
are "failure modes". However, these 
failure modes still result in a representation that can be useful to many 
applications. I contend that that use 
case (b) is as prevalent as (a) if not more prevalent. I can be convinced that 
use case (c) is less prevalent. I see 
less utility for use case (c) when use case (b) is available via the high-level 
API.

Example: a complete, but non-native representation
--------------------------------------------------

Alice's application has the "Autos" and "Currency" libraries. Her application 
dumps the following: 

    !!autos.Car
    plate: 12-XP-F4
    value: !!Money {amount: 8113.00, currency: USD }

Bob receives the above YAML. His application only has the "Currency" library. 
According to the spec, the YAML 
processor can create a complete representation, but not a native one because 
it's lacking some native types 
(autos.Car). SnakeYAML should have a straightforward way through the high level 
API to load this. An easy 
implementation would be use case (c) and ignore all tag information and do no 
implicit typing. A somewhat 
more useful implementation would be use case (b), to construct those native 
types that are available (value → 
Money) and to also do implicit typing (value.amount → float).

This is essentially what I understand from this part of the spec:

    In a given processing environment, there need not be an available native type corresponding to a given tag. 
If a node’s tag is unavailable, a YAML processor will not be able to 
construct a native data structure for it. In 
this case, a complete representation may still be composed, and an application 
may wish to use this 
representation directly.

Again, I understand the low-level API makes this possible (comment 4). It 
should be possible with the high-
level API.

Response to comment 9
=====================

1) "General purpose YAML processor": the high-level API of SnakeYAML is suited 
only to use case (a). I believe 
this is the narrowest possible interpretation of a YAML processor from the 
spec. A general purpose processor 
would support (b) and (c) as conveniently as it supports (a).

2)
> > I shouldn't have to subclass a new Constructor to coerce it to load a YAML 
document
> > as a simple data structure of sequences, maps, and scalars.
> 
> You do not have to !!! You only need to do it when the input is inconsistent 
or you
> do not know how to make it consistent.

Having to subclass a new Constructor to achieve use cases (b) and (c) through 
the low-level API is a 
shortcoming.

3)

> SnakeYAML can also parse a valid YAML document. Please check the low level 
'parse()'
> method.

Having to use low-level parse() for use cases (b) and (c) instead of high-level 
load() is a shortcoming.

4)
> > You assume that I control the code producing the YAML
> 
> I do not. I simply state that YAML producers should be aware that the content 
they
> generate can be consumed by different parties

I agree with you that they *should*. But the reality is that (some, many, 
most?) won't. The downstream effect 
is that my application must pay the price in terms of complexity to handle this 
YAML when I use SnakeYAML.

Of course, I do the smart thing and create a wrapper around SnakeYAML so the 
complexity occurs once and is 
hidden from the rest of my application (or applications). But that still means 
every individual or organization 
has to write this same wrapper. Based on my contention that use case (b) is 
prevalent, that's a lot of repeat 
effort. It should be easier to do.

7)
> > Here is what I would consider easy:
> > 
> > Yaml yaml = new Yaml();
> > yaml.setIgnoreTags(true);
> > yaml.load(...);
> 
> I consider this completely unclear.

Agreed. That was a bad proposal on my part. What I intended was more something 
like 
`yaml.setIgnoreUnavailableTags(true)` or `setFullyNative(false)`.

But my point wasn't to propose a specific method name, or that it should even 
be a method on Yaml. Instead I 
was trying to contrast several lines of code subclassing Constructor with a 
one-liner.

8)
> Let us see an example with XStream.
> 
> <person>
>   <firstname>Joe</firstname>
>   <nonsenseString>aaa</nonsenseString>
>   <nonsenseInt>123</nonsenseInt>
> </person>
> 
> The 'Person' JavaBean does not have nonsenseString and nonsenseInt 
properties. 
> Please note that I do not want to simply parse the XML (which is no problem of
> course) but I wish to create a statefull 'Person' instance. Can XStream 
create such
> an instance ?

I agree that should fail and the YAML/SnakeYAML equivalent should also fail. 
Extending my example up top:

    !!autos.Car
    plate: 12-XP-F4
    value: !!Money {amount: 8113.00, currency: USD, garbage: "boom" }

Bob's application, which has the "Currency" library, would fail to construct a 
Money instance for `value` when 
it attempted to set the property `garbage`.

However, if Bob removed all dependencies on the Currency library from his 
application, and then removed the 
currency library, then something similar to `new 
Yaml().setIgnoreUnavailableTags(true).load(...)` would work. 
Instead of a Money instance, `value` would just be a Map.

> Do you consider XStream as a "general purpose XML processor" ?

No. I consider it a narrow tool that does use case (a) (except for XML, not 
YAML). But now I think we're getting 
to the crux of our disagreement.

I don't consider XStream's singular focus on use case (a) to be a shortcoming. 
Why then do I judge SnakeYAML 
differently? Because unlike with XML, there aren't an abundance of competing, 
complete, quality 
implementations of YAML processors the way there is for XML parsers; there's 
SnakeYAML and then then 
there's...SnakeYAML. Yours is the only one that meets those criteria (for Java) 
that's actively maintained.

I believe I understand your position now. High-level SnakeYAML is equivalent to 
XStream. Low-level 
SnakeYAML API is equivalent to an XML parser.

> I am afraid you need to explain the request taking into account all the 
consequences.

I think I misunderstand you. Otherwise, that's an unfair burden to place on 
someone contributing feedback.

I *have* spent a good amount of time reading (and re-reading) the YAML spec to 
make sure my position is 
reasonable and valid. You're saying I must anticipate all of the consequences 
before sharing my idea?

> Now I see the following: I wish to drop any trash to SnakeYAML and it must be 
able to
> create a valid Java instance anyway.

Per above, that's not what I'm asking for.

Response to comment 10
======================

> Are you talking about local tags ? If only local tags are ignored does it 
solve the 
> problem ? (global tags and implicit types work as usual)

For use case (b), implicit tags as well as recognized and available native 
types would work as usual. My 
`setIgnoreTags(true)` in comment 8 was a badly named proposal. Per above, I 
meant something akin to 
`setIgnoreUnavailableTags(true)`.

But no, I don't think a distinction between local and global tags would do it. 
It's possible to have a global tag 
reference an unavailable native type, correct?

Original comment by toolbea...@gmail.com on 13 Jan 2010 at 7:54

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

I believe I've made my position clear. If not, I lack the stamina to try and 
explain myself again. What remains should be differences of opinion on which 
use cases the 
high-level API should support. If you disagree with my position, then go ahead 
and close/cancel this enhancement request.

Original comment by toolbea...@gmail.com on 13 Jan 2010 at 6:37

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

First of all - thank you very much for your time. I think this issue gives a 
good 
overview for anyone who wants to manage "strange" tags coming from another 
parser.

In general I agree with your proposal to be more flexible. I just do not get 
how to 
resolve minor issues (which become big when you try to implement them).

I see better you position now and I do not want to reply to every statement (I 
will 
save some disk space for Google :).

Since I do not see a real business case for myself I cannot really implement 
your 
requirement.
I think we can proceed as following. (And you get my full support for it.)
- create a Mercurial clone (http://code.google.com/p/snakeyaml/source/clones)
- write a test case. No problem it fails, at least we can see what we want to 
achieve 
at the end
- try to implement the feature. You are free to change _anything_. Just keep in 
mind 
that the existing tests must succeed.
- once we see the new code we can discuss the required changes and the 
consequences

P.S. 
I was trying to introduce an interface which is called when the tag is unknown. 
Similar to what is done for error handler in SAX when parsing XML. 
Unfortunately it 
became more complicated then I expected and I dropped it.

Original comment by py4fun@gmail.com on 14 Jan 2010 at 10:21

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

"Each comment triggers notification emails. So, please do not post "+1 Me too!".
Instead, click the star icon."

Notification emails are exactly what I want, though I starred it, too.

I also struggled with it for a while and had to change methods to get anywhere. 
A generic mode would be an excellent addition IMO.

Original comment by fred.co...@gmail.com on 30 Sep 2012 at 4:54

Added labels: ****
Removed labels: ****

ilmoeuro / snakeyaml

support partial representation when document contains unrecognized tags #39