jorgemanrubia / truncato

A tool for truncating HTML strings efficiently
MIT License
59 stars 17 forks source link

Issues with Nokogiri 1.13.5 and 1.13.6 #20

Closed Adsidera closed 2 years ago

Adsidera commented 2 years ago

After upgrading Nokogiri to version 1.13.5 (or 1.13.6), we get this:

Truncato.truncate "<p>some text</p>", max_length: 4
"&lt;<p>...</p>"

Truncato.truncate "<p>some text</p>"
"&lt;<p>__truncato_root__&gt;</p><p>s...</p>"

Can you please advise?

franzliedke commented 2 years ago

In case this helps: The Nokogiri changelog lists this as a known breakage (part of a security fix):

  • [CRuby] The libxml2 HTML parser in v2.9.14 recovers from some broken markup differently. Notably, the XML CDATA escape sequence <![CDATA[ and incorrectly-opened comments will result in HTML text nodes starting with &lt;! instead of skipping the invalid tag. This behavior is a direct result of the quadratic-behavior fix noted above. The behavior of downstream sanitizers relying on this behavior will also change. Some tests describing the changed behavior are in test/html4/test_comments.rb.

So apparently we're dealing with broken markup here? Is that intended? (I did not look into the Truncato code yet.)

mattyoho commented 2 years ago

FWIW, in a Rails context, tossing this in an initializer will avoid the bug:

silence_warnings do
  Truncato::ARTIFICIAL_ROOT_NAME = "truncato-artificial-root".freeze
end

That will override the gem-defined value, which is probably an invalid tag name due to the underscores: https://github.com/jorgemanrubia/truncato/blob/7b93028ce9988810d3f95d513b7bc60f0a8fe7bd/lib/truncato/truncato.rb#L9

PR opened here https://github.com/jorgemanrubia/truncato/pull/21.

jorgemanrubia commented 2 years ago

Fixed via https://github.com/jorgemanrubia/truncato/pull/21.