lutaml / lutaml-model

LutaML Model is the Ruby data modeler part of the LutaML data modeling suite. It supports creating serialization object models (XML, YAML, JSON, TOML) and mappings to and from them.
Other
2 stars 2 forks source link

(URGENT) Mixed content at root not working #57

Closed andrew2net closed 1 week ago

andrew2net commented 2 months ago

I'm trying to use the mixed content option to parse the XML:

<docidentifier type="BIPM" scope="part" primary="true">CIPM 43<sup>e</sup> réunion (1950)</docidentifier>

The class:

module Relaton
  module Model
    class Docidentifier < Lutaml::Model::Serializable

      attribute :type, Lutaml::Model::Type::String
      attribute :scope, Lutaml::Model::Type::String
      attribute :primary, Lutaml::Model::Type::Boolean
      attribute :sup, Lutaml::Model::Type::String

      xml do
        root "docidentifier", mixed: true
        map_element "sup", to: :sup
        map_attribute "type", to: :type
        map_attribute "scope", to: :scope
        map_attribute "primary", to: :primary
      end
    end
  end
end

The result:

> docid = Relaton::Model::Docidentifier.from_xml xml
#class = Relaton::Model::Docidentifier
@element_order = ["text", :sup, "text"]
@ordered = true
@primary = true
@scope = "part"
@sup = "e"
@type = "BIPM"

> docid.to_xml
"<docidentifier type=\"BIPM\" scope=\"part\" primary=\"true\">\n" + "  <sup>e</sup>\n" + "</docidentifier>"

The text content disappears. What is wrong here?

ronaldtse commented 2 months ago

Not sure what the error is but in the mixed content the elements like “sup” might be supposed to be “collection”.

andrew2net commented 2 months ago

@ronaldtse Adding the collection option doesn't change anything. The example doesn't have the option too.

ronaldtse commented 2 months ago

@andrew2net this will be handled by @HassanAkbar in #58 . Once #58 passes, this should work.

HassanAkbar commented 2 months ago

@andrew2net the content is not mapped to anything that is why it is not showing in the output, for mapping content in xml we need to use the map_content to: <attribute name>.

So the above class will be changed to

module Relaton
  module Model
    class Docidentifier < Lutaml::Model::Serializable

      attribute :type, Lutaml::Model::Type::String
      attribute :content, Lutaml::Model::Type::String
      attribute :scope, Lutaml::Model::Type::String
      attribute :primary, Lutaml::Model::Type::Boolean
      attribute :sup, Lutaml::Model::Type::String

      xml do
        root "docidentifier", mixed: true
        map_element "sup", to: :sup

        # for mapping content
        map_content to: :content

        map_attribute "type", to: :type
        map_attribute "scope", to: :scope
        map_attribute "primary", to: :primary
      end
    end
  end
end
andrew2net commented 2 months ago

@HassanAkbar in the doc described same case without mapping content. Please update the doc.

It works, but I get another problem with Separate serialization model

module Relaton
  module Bib
    class Docidentifier
      attr_accessor :type, :scope, :content
      ...

module Relaton
  module Model
    class Docidentifier < Lutaml::Model::Serializable
      model ::Relaton::Bib::Docidentifier
      ...

We need Relaton::Bib::Docidentifier@content to be set to "CIPM 43<sup>e</sup> réunion (1950)" when the docidentifier element is parsed

xml = <<~XML
  <docidentifier language="en" script="Latn" locale="EN-us" type="BIPM" scope="part" primary="true">
    CIPM 43<sup>e</sup> réunion (1950)
  </docidentifier>
XML

docid = Relaton::Model::Docidentifier.from_xml xml

but the Relaton::Model::Docidentifier required to have sup= method in Relaton::Bib::Docidentifier. After adding the sup attribute I got:

docid.content
=> ["\n" + "  CIPM 43", " r\u00E9union (1950)\n"]

docid.sup
=> ["e"]

Is there a way to solve this problem?

ronaldtse commented 2 months ago

Looks like the content attribute needs to be specified as mixed? In general, we have not refined a good way of defining a mixed content model as easily done in RNC grammar. Something we need to work on.

andrew2net commented 2 months ago

@ronaldtse mixed content also doesn't work.

I think we can make our life more easier using Nokogiri's feature that escapes all XML markups inside text nodes.

id = Relaton::Bib::Docidentifier.new(type: "BIPM", content: "CIPM 43<sup>e</sup> réunion (1950)")

xml = Relaton::Model::Docidentifier.to_xml id
=> "<docidentifier type=\"BIPM\">CIPM 43&lt;sup&gt;e&lt;/sup&gt; r\u00E9union (1950)</docidentifier>"

id2 = Relaton::Model::Docidentifier.from_xml xml

puts id2.content
CIPM 43<sup>e</sup> réunion (1950)
ronaldtse commented 2 months ago

@andrew2net Are you asking for an option to treat XML content as text?

What kind of API are you looking for the Bib::Docidentifier and Model::Docidentifier object?

Can you come up with some Ruby code showing what exactly you want to see?

andrew2net commented 2 months ago

@ronaldtse I want to see <sup> and <sub> tags in BIPM IDs.

id = Relaton::Model::Docidentifier.from_xml(xml)

id.class
=> Relaton::Bib::Docidentifier

id.content
=> "CIPM 43<sup>e</sup> réunion (1950)"

But we also have LocalizedMarkedUpString which can contain many other XML tags. A lot of Relaton model's entities inherit from the LocalizedMarkedUpString. I've tried to implement all the elements with lutaml-model and it turned out to be much more difficult than without it.

With treating the LocalizedMarkedUpString content as text we don't need to implement all the TextElement entities. The con is that the marked up text won't be validated. If that's ok then we don't need to do anything in lutaml-model to solve this issue, and can close this issue.

ronaldtse commented 2 months ago

@andrew2net so what should @HassanAkbar do here to fix the problem? I'm a bit confused.

So the problem is when using a separate model, the assigned content is no longer a string but is parsed like XML?

andrew2net commented 2 months ago

@ronaldtse yes, content in the separated model should be CIPM 43<sup>e</sup> réunion (1950) but lutaml-model returns ["\n" + " CIPM 43", " r\u00E9union (1950)\n"]

We can solve this issue by handling these IDs as String. In other words the ID in XML output will be escaped:

<docidentifier type="BIPM">CIPM 43&lt;sup&gt;e&lt;/sup&gt; r\u00E9union (1950)</docidentifier>

Nokogiri can encode and decode such content. I tested it and got desired content in separated model.

So we don't need to handle any tags in text element (there are a lot other tags for content defined in the grammar).

But the escaped tags aren't validated by Nokogiri. If that's ok then we can use this approach

ronaldtse commented 2 months ago

@andrew2net Clearly we need to parse the content as a proper text model that checks its grammar/validity. Handling it as a string is a bad hack.

What is the text model there though?

Supposed it is a class called RichText, then we can just point :content to be casted as RichText?

andrew2net commented 1 month ago

@ronaldtse docidentifier's content is LocalizedMarkedUpString which is collection of TextElement I tried to create TextElement but it doesn't returns string with tags

class TextElement < Lutaml::Model::Serializable
  attribute :sup, :string
  attribute :sub, :string
  attribute :text, :string
  xml do
    root 'text-element', mixed: true
    map_content to: :text
    map_element "sup", to: :sup
    map_element "sub", to: :sub
  end
end

class Id
  attr_accessor :id
end

class Docid < Lutaml::Model::Serializable
  model Id
  attribute :id, TextElement
  xml do
    root 'docid'
    map_content to: :id
  end
end

> docid = Docid.from_xml '<docid>Str<sub>2</sub>text<sup>1</suo>123</docid>'
 => #<Id:0x000000012456c500 @id=["Str", "text", "123"]>

# it returns array of strings without tags
> docid.id
 => ["Str", "text", "123"]

# tags are lost
> Docid.to_xml docid
 => "<docid>Strtext123</docid>" 

What am I doing wrong?

andrew2net commented 1 month ago

@HassanAkbar can you help with this issue?

HassanAkbar commented 1 month ago

@andrew2net Sure, I'm looking into it right now

HassanAkbar commented 1 month ago

@andrew2net To delegate the attributes to another model we need to use the delegate keyword.

require "lutaml/model"
require "lutaml/model/xml_adapter/nokogiri_adapter"

Lutaml::Model::Config.xml_adapter = Lutaml::Model::XmlAdapter::NokogiriAdapter

class TextElement < Lutaml::Model::Serializable
  attribute :sup, :string
  attribute :sub, :string
  attribute :text, :string
  xml do
    root 'text-element', mixed: true
    map_content to: :text
    map_element "sup", to: :sup
    map_element "sub", to: :sub
  end
end

class Id
  attr_accessor :id
end

class Docid < Lutaml::Model::Serializable
  model Id
  attribute :id, TextElement
  xml do
    root 'docid'
    map_content to: :text, delegate: :id
    map_element :sub, to: :sub, delegate: :id
    map_element :sup, to: :sup, delegate: :id
  end
end

> docid = Docid.from_xml '<docid>Str<sub>2</sub>text<sup>1</suo>123</docid>'
 => #<Id:0x0000000108b253f0 @id=#<TextElement:0x0000000108b250f8 @sub="2", @sup="1", @text=["Str", "text", "123"], @validate_on_set=false>> 

But when we do something like Docid.to_xml docid it crashes because of the issue you mentioned here -> https://github.com/lutaml/lutaml-model/issues/83.

I'm currently working on it and will let you know as soon as it is resolved.

andrew2net commented 1 month ago

@HassanAkbar it doesn't work:

> docid = Docid.from_xml '<docid>Str<sub>2</sub>text<sup>1</suo>123</docid>'
gems/ruby-3.3.3/gems/lutaml-model-0.3.10/lib/lutaml/model/serialize.rb:317:in `block in apply_xml_mapping': Attribute 'sub' not found in Docid (RuntimeError)
HassanAkbar commented 1 month ago

@andrew2net I have fixed the issue, it should work after https://github.com/lutaml/lutaml-model/pull/91 is merged.

andrew2net commented 1 month ago
> docid = Docid.from_xml '<docid>Str<sub>2</sub>text<sup>1</suo>123</docid>'
 => #<Id:0x0000000108b253f0 @id=#<TextElement:0x0000000108b250f8 @sub="2", @sup="1", @text=["Str", "text", "123"], @validate_on_set=false>> 

@HassanAkbar the id attribute is set to TextElement instance, but it needs to be String. In this case Str<sub>2</sub>text<sup>1</suo>123. How can we achieve it?

andrew2net commented 1 month ago

@HassanAkbar the TextElement has 20+ elements and is used in dozens cases as mixed content. This solution requires to delegate each TextElement element in each case where it is used:

    map_element :sub, to: :sub, delegate: :id
    map_element :sup, to: :sup, delegate: :id

We also have PureTextElement, NestedTextElement, and some other, similar to TextElement elements. This requires writing thousands delegation lines of code. Can we avoid delegating each element of mixing content?

HassanAkbar commented 1 month ago

@HassanAkbar the id attribute is set to TextElement instance, but it needs to be String. In this case Str<sub>2</sub>text<sup>1</suo>123. How can we achieve it?

@andrew2net this looks like a use case for raw option for attribute which is being added in this PR -> https://github.com/lutaml/lutaml-model/pull/94. Will let you know once this is finalized.

ronaldtse commented 1 month ago

@andrew2net @HassanAkbar

Why not just:

class TextElement < Lutaml::Model::Serializable
  attribute :sup, …
end

class Docid < Lutaml::Model::Serializable
  attribute :my_text_content, TextElement
end
HassanAkbar commented 1 month ago

@andrew2net @HassanAkbar

Why not just:

class TextElement < Lutaml::Model::Serializable
  attribute :sup, …
end

class Docid < Lutaml::Model::Serializable
  attribute :my_text_content, TextElement
end

@ronaldtse the issue with this is that in xml we have different mappings for attributes, elements and content. So currently we can not decide about which of the above we need to send to the my_text_content.

I think what we can do here is, make another tag like map_raw_content or map_all or simply map to map the entire tag's content as string to the attribute. What do you suggest?

ronaldtse commented 1 month ago

the issue with this is that in xml we have different mappings for attributes, elements and content. So currently we can not decide about which of the above we need to send to the my_text_content.

I don't really understand this. Can you provide an example XML with a Lutaml::Model code snippet to illustrate?

HassanAkbar commented 1 month ago

the issue with this is that in xml we have different mappings for attributes, elements and content. So currently we can not decide about which of the above we need to send to the my_text_content.

I don't really understand this. Can you provide an example XML with a Lutaml::Model code snippet to illustrate?

class TextElement < Lutaml::Model::Serializable
  attribute :sup, …
end

class Docid < Lutaml::Model::Serializable
  attribute :my_text_content, TextElement
end
<docid>
  Str <sub>2</sub> text <sup>1</suo> 123
</docid>

@ronaldtse In this example above the xml for docid has

and we need to define the mappings for all elements, attributes and content in the docid to map them to the proper attributes. From what I understand from the @andrew2net example is that this needs to happen for lots of elements that have lots of tags and mapping all those tags has no benefit so mapping them to a string is sufficient.

So, what I'm proposing is adding a tag like map_all to: :raw_text which will map all the content inside the tag to that attribute. e.g

class Docid < Lutaml::Model::Serializable
  attribute :my_text_content, :string

  xml do
    map_all to: : my_text_content
  end
end

docid = Docid.from_xml '<docid>Str<sub>2</sub>text<sup>1</suo>123</docid>'
# <Docid:0x0000000108b253f0 @my_text_content="Str<sub>2</sub>text<sup>1</suo>123", @validate_on_set=false> 

we can rename map_all to something more fitting but this is a rough idea I have in mind for this.

andrew2net commented 1 month ago

@HassanAkbar the last example looks good to me

ronaldtse commented 3 weeks ago

@HassanAkbar Got it, then let's move ahead with map_all. We also need to ensure that the other map_* methods can't be used together with map_all.

HassanAkbar commented 2 weeks ago

We also need to ensure that the other map_* methods can't be used together with map_all.

@ronaldtse Got it, will add a validation for that