...which are removed by Nokogiri's XML parser. The existing code parses HTML as XML, so Nokogiri only recognizes XML's much smaller set of entities. To preserve HTML entities, parsing the text as an HTML fragment seems to work much better, and I have no encountered any issues with the change. (However, I have not been able to run the project tests --- they hang after the first test.)
I think there would be other cases where valid HTML would not be valid XML, so using the parser in XML mode does not seem appropriate here.
...which are removed by Nokogiri's XML parser. The existing code parses HTML as XML, so Nokogiri only recognizes XML's much smaller set of entities. To preserve HTML entities, parsing the text as an HTML fragment seems to work much better, and I have no encountered any issues with the change. (However, I have not been able to run the project tests --- they hang after the first test.)
I think there would be other cases where valid HTML would not be valid XML, so using the parser in XML mode does not seem appropriate here.