Closed opoudjis closed 1 month ago
In addition, if you will still process such strings as MappingHash, you must assume a mixed document model for the string, and you must allow tags to be repeated—meaning, you want a list of key-value pairs, and not a hash as you have now:
this <em>is</em> a <em>text</em> you see
goes not to
{ "text" => "this", "em" => "is", "text" => "a", "em" => "text", "text" => "you see" }
which will merely collapse into
{ "em" => "text", "text" => "you see" }
but
[ {"text" => "this"}, {"em" => "is"}, {"text" => "a"}, {"em" => "text"}, {"text" => "you see" } ]
Marked up text is not a Hash or an Object. Object-oriented serialisers are not designed for marked up text. And lutaml-model the way it is being pushed very much needs to deal with marked up text.
Just to add that this is a special case of treating XML content as plain text. Normally, angular brackets when encoded in XML will appear in escaped form between XML tags.
What is being asked here now is a way to obtain plain text representation of XML content, which is like obtaining the raw YAML syntax key and value formats of a YAML object.
So I think it would be more appropriate to call this feature raw
, ie. obtain the raw/unparsed serialization of a key/tag content.
Eg. This outcome indicated by @opoudjis could be attribute xxx, :string, raw: true
.
@ronaldtse I was looking into adding the raw
tag to attributes and the issue is that we are using hash as the intermediary data structure and it is not possible to convert the hash back to xml or any other format without the mapping rules, which are not present when converting the element to hash.
We can either save the metadata in the mapping_hash
or I think we can skip the hash generation for xml and use the internal xml_document
directly to generate the instance.
What do you suggest?
I vote for mapping_hash
, for simplicity of your code. The problem you have is differentiating directives from actual attributes of XML (an XML tag with an actual attribute raw
). Goessner notation and JSON-LD deal with that by injecting illegal prefixes into JSON names, to differentiate directives: "#text"
, "@id"
.
Yes we can use prefixes to differentiate attributes. I'm not too concerned with that.
My main concern is that we don't want to process the raw
XML like this:
XML => mapping_hash with an object structure => to_xml => string
We should have this:
XML => string
See the following code:
The YAML deserialisation correctly leaves the street and city strings as it found them, with the markup:
street
remainsA <a>N</a> B
.There is a bug in the XML deserialisation. It is too naive about what it will find.
lutaml-model-0.3.10/lib/lutaml/model/serialize.rb
apply_xml_mapping(doc, instance, options = {})
contains the following code:Now,
<street><a>N</a></street>
is mapped to MappingHash{"a"=>{"text"=>"N"}}
. That is an internal representation, which this code is meant to convert into the right format.is true: ::Lutaml::Model::MappingHash
{"a"=>{"text"=>"N"}}
is a Hash, and attr.type is Lutaml::Model::AttributeBut
value = value["text"]
is wrong. Especially because value["text"] is nil. As a result of this,<street><a>N</a></street>
is parsed as if it is<street/>
: its content is completely ignored.In case there is space (which there will be in pretty-printed XML), we get an even worse outcome:
is parsed as
{"text" => "\n\n", "a"=>{"text"=>"N"}}.
And thisvalue = value["text"]
mapsvalue
to "\n\n".value = value["text"]
MUST NOT be run if there are any keys in value other than "text" (including if "text" is not a key at all.) You must not mangle the contents of parsed text, because of an assumption that you have an exhaustively specified information model for any XML you ever encounter. Strings need to be left alone, and coding needs to be defensive.If this is not possible, then you must allow a
:raw
processing directive on attributes, which does not attempt to parse the content of such attributes into MappingHash on initial parsing. Although given that MappingHash is a blanket, generic initial processing of XML structure into a hash, I don't think you will be able to do that at all.