gjtorikian / html-pipeline

HTML processing filters and utilities
MIT License
2.26k stars 380 forks source link

Question about original and new SanitizationFilter #406

Open digitalmoksha opened 1 month ago

digitalmoksha commented 1 month ago

In the original SanitizationFilter, which uses Sanitize, you had two transformers

        transformers: [
          # Top-level <li> elements are removed because they can break out of
          # containing markup.
          lambda { |env|
            name = env[:node_name]
            node = env[:node]
            if name == LIST_ITEM && node.ancestors.none? { |n| LISTS.include?(n.name) }
              node.replace(node.children)
            end
          },

          # Table child elements that are not contained by a <table> are removed.
          lambda { |env|
            name = env[:node_name]
            node = env[:node]
            if (TABLE_SECTIONS.include?(name) || TABLE_ITEMS.include?(name)) && node.ancestors.none? { |n| n.name == TABLE }
              node.replace(node.children)
            end
          }
        ].freeze

I don't see these in the new version based on Selma. Are these transformations no longer needed?

I assume that any transformations would need to be written as Selma handlers. I don't see a way to add those to the sanitization config, so I'm guessing these would need to be node filters.

gjtorikian commented 1 month ago

If I remember correctly, at the time, I thought that this seemed unnecessary, since as you said the same action can be accomplished with node filters. As I look at this again, though, I could see a use case for some kind of config to simplify node addition/removal. Something like (spitballing):

remove_unsemantic: ["li", Config::TABLE_CHILD_ELEMENTS]

Since I can imagine some people would use this gem to create well-formed markup, perhaps the gem should just make it much easier to indicate whether you want parents or children of a certain node moved/replaced/deleted/etc.

Does that answer your question, or did I misunderstand?

digitalmoksha commented 1 month ago

I guess initially I was curious if you thought those conditions still needed to be sanitized. If so, it might make sense to include those in the filter, as handlers passed into Selma::Rewriter.new(sanitizer: sanitization_config).

Since I can imagine some people would use this gem to create well-formed markup, perhaps the gem should just make it much easier to indicate whether you want parents or children of a certain node moved/replaced/deleted/etc.

I can see the usefulness of that. Though currently my transformers are more along the lines of remove the class unless the class == 'anchor', or remove the id unless it's a footnote or other parser generated id. I think I will need handlers for those.

For now I plan on using Selma directly for sanitization until I can do the much heavier lift of upgrading html-pipeline