lutaml / lutaml-model

LutaML Model is the Ruby data modeler part of the LutaML data modeling suite. It supports creating serialization object models (XML, YAML, JSON, TOML) and mappings to and from them.
Other
2 stars 2 forks source link

Naive normalisation of XML if it contains unknown markup #90

Closed opoudjis closed 1 month ago

opoudjis commented 2 months ago

See the following code:

equire "lutaml/model"
require 'lutaml/model/xml_adapter/nokogiri_adapter'

Lutaml::Model::Config.configure do |config|
  config.xml_adapter = Lutaml::Model::XmlAdapter::NokogiriAdapter
  config.yaml_adapter = Lutaml::Model::YamlAdapter::StandardYamlAdapter
  config.json_adapter = Lutaml::Model::JsonAdapter::StandardJsonAdapter
end

module SimpleModel
  class Address < Lutaml::Model::Serializable
    attribute :street, ::Lutaml::Model::Type::String
    attribute :city, :string
    attribute :address, Address

    xml do
      root "address"
      map_element "street", to: :street 
      map_element "city", to: :city 
    end

  xml = <<~XML
  <address>
  <street><a>N</a></street>
  <city><a>M</a></city>
  </address>
  XML

  yaml = <<~YAML
street: A <a>N</a> B
city: B <a>N</a> B
  YAML

  sample = SimpleModel::Address.from_xml(xml)
  sample = SimpleModel::Address.from_yaml(yaml)
end

The YAML deserialisation correctly leaves the street and city strings as it found them, with the markup: street remains A <a>N</a> B.

There is a bug in the XML deserialisation. It is too naive about what it will find.

lutaml-model-0.3.10/lib/lutaml/model/serialize.rb

apply_xml_mapping(doc, instance, options = {}) contains the following code:

    mappings.each do |rule|
            attr = attributes[rule.to]
            raise "Attribute '#{rule.to}' not found in #{self}" unless attr

            is_content_mapping = rule.name.nil?

            value = if is_content_mapping
                      doc["text"]
                    else
                      doc[rule.name.to_s] || doc[rule.name.to_sym]
                    end 

            value = [value].compact if attr.collection? && !value.is_a?(Array)

            if value.is_a?(Array)
              value = value.map do |v|
                v.is_a?(Hash) && !(attr.type <= Serialize) ? v["text"] : v
              end
            elsif !(attr.type <= Serialize) && value.is_a?(Hash) && attr.type != Lutaml::Model::Type::Hash
              value = value["text"]
            end

Now, <street><a>N</a></street> is mapped to MappingHash {"a"=>{"text"=>"N"}}. That is an internal representation, which this code is meant to convert into the right format.

!(attr.type <= Serialize) && value.is_a?(Hash) && attr.type != Lutaml::Model::Type::Hash

is true: ::Lutaml::Model::MappingHash {"a"=>{"text"=>"N"}} is a Hash, and attr.type is Lutaml::Model::Attribute

But value = value["text"] is wrong. Especially because value["text"] is nil. As a result of this, <street><a>N</a></street> is parsed as if it is <street/>: its content is completely ignored.

In case there is space (which there will be in pretty-printed XML), we get an even worse outcome:

<street>

<a>N</a>
</street>

is parsed as {"text" => "\n\n", "a"=>{"text"=>"N"}}. And this value = value["text"] maps value to "\n\n".

value = value["text"] MUST NOT be run if there are any keys in value other than "text" (including if "text" is not a key at all.) You must not mangle the contents of parsed text, because of an assumption that you have an exhaustively specified information model for any XML you ever encounter. Strings need to be left alone, and coding needs to be defensive.

If this is not possible, then you must allow a :raw processing directive on attributes, which does not attempt to parse the content of such attributes into MappingHash on initial parsing. Although given that MappingHash is a blanket, generic initial processing of XML structure into a hash, I don't think you will be able to do that at all.

opoudjis commented 2 months ago

In addition, if you will still process such strings as MappingHash, you must assume a mixed document model for the string, and you must allow tags to be repeated—meaning, you want a list of key-value pairs, and not a hash as you have now:

this <em>is</em> a <em>text</em> you see

goes not to

{ "text" => "this", "em" => "is", "text" => "a", "em" => "text", "text" => "you see" }

which will merely collapse into

{ "em" => "text", "text" => "you see" }

but

[ {"text" => "this"}, {"em" => "is"}, {"text" => "a"}, {"em" => "text"}, {"text" => "you see" } ]

Marked up text is not a Hash or an Object. Object-oriented serialisers are not designed for marked up text. And lutaml-model the way it is being pushed very much needs to deal with marked up text.

ronaldtse commented 2 months ago

Just to add that this is a special case of treating XML content as plain text. Normally, angular brackets when encoded in XML will appear in escaped form between XML tags.

What is being asked here now is a way to obtain plain text representation of XML content, which is like obtaining the raw YAML syntax key and value formats of a YAML object.

So I think it would be more appropriate to call this feature raw, ie. obtain the raw/unparsed serialization of a key/tag content.

Eg. This outcome indicated by @opoudjis could be attribute xxx, :string, raw: true.

HassanAkbar commented 2 months ago

@ronaldtse I was looking into adding the raw tag to attributes and the issue is that we are using hash as the intermediary data structure and it is not possible to convert the hash back to xml or any other format without the mapping rules, which are not present when converting the element to hash.

We can either save the metadata in the mapping_hash or I think we can skip the hash generation for xml and use the internal xml_document directly to generate the instance.

What do you suggest?

opoudjis commented 2 months ago

I vote for mapping_hash, for simplicity of your code. The problem you have is differentiating directives from actual attributes of XML (an XML tag with an actual attribute raw). Goessner notation and JSON-LD deal with that by injecting illegal prefixes into JSON names, to differentiate directives: "#text", "@id".

ronaldtse commented 2 months ago

Yes we can use prefixes to differentiate attributes. I'm not too concerned with that.

My main concern is that we don't want to process the raw XML like this:

XML => mapping_hash with an object structure => to_xml => string

We should have this:

XML => string