relaton / relaton-bib

MIT License
3 stars 1 forks source link

Some Relaton flavors generating XML that renders `&`, `<` and `>` into a blank space in properties that contain them #58

Closed ronaldtse closed 2 years ago

ronaldtse commented 2 years ago

In Nokogiri v1.13 the treatment of the &, < and > signs seems to have changed (correctly). These are invalid XML symbols as content.

This is described in: https://github.com/sparklemotion/nokogiri/issues/2483

Right now any text with &, <, >, etc., will be rendered without the symbols.

It is said that the text method should be used to encode text with these invalid XML characters:

require 'nokogiri'
doc = Nokogiri::XML::Builder.new(encoding: 'UTF-8') do |xml|
  xml.root do
      xml.string {
        xml.text("bar & bar")
      }
  end
end
puts doc.to_xml

The to_xml method in some Relaton flavor gems seems to be losing the & character in text (according to @andrew2net ).

Other than this location: https://github.com/relaton/relaton-bib/blob/cae6e9fd7598f560d14ee94623e7f210e2ab7ac1/lib/relaton_bib/bibliographic_item.rb#L325-L333

There are not many places that define def to_xml (108): https://github.com/search?l=Ruby&q=org%3Arelaton+%22def+to_xml%22&type=Code

Screenshot 2022-05-11 at 12 51 25 AM

Even less wth "abstract" (42): https://github.com/search?l=Ruby&q=org%3Arelaton+%22abstract%22&type=Code

Screenshot 2022-05-11 at 12 53 16 AM

@andrew2net is currently investigating which flavor which document this problem originated from.

This task is to add a test for that and fix the behavior.

opoudjis commented 2 years ago

I had different issues with Nokogiri 1.13 (it not dealing with idiosyncratic Word HTML), which I resolved through postprocessing: https://github.com/metanorma/html2doc/issues/69

I think you're going to need a preprocessing step, to escape those characters if standalone, before passing them to the to_xml constructor in every field. (And xml.text(...) sounds like such a preprocessor.)

andrew2net commented 2 years ago

The to_xml method in some Relaton flavor gems seems to be losing the & character in text (according to @andrew2net ).

It's fixed in the realton-bib v 1.11.5. To escape these special symbols it needs to use the method xml.text(...). RFC allows HTML tags inside an abstract element. To parse the element it needs to use the method xml.at('abstract').inner_html.