gohugoio / hugo

The world’s fastest framework for building websites.
https://gohugo.io
Apache License 2.0
75.78k stars 7.53k forks source link

Symbol '>' breaks plainify function #9199

Closed goose3228 closed 2 years ago

goose3228 commented 2 years ago

Also affects auto-generated Page.Summary, Page.Plain and so on.

What version of Hugo are you using (hugo version)?

$ hugo version
hugo v0.89.2+extended linux/amd64 BuildDate=unknown

Does this issue reproduce with the latest release?

Yes, with git/master too

Test command (in template):

{{ print "<div data-action='click->my-controller#doThing'>qwe</div>" | plainify }}

Expected output:

qwe

Current output:

my-controller#doThing'qwe

P.S. arrow -> is commonly used in html attributes by stimulus.js framework. The problem is sensible when page starts with interactive shortcode, and summary gets garbage from data attributes.

goose3228 commented 2 years ago

I've had a look at the source, which is quite simple. Is there any reason not to use a library?

khayyamsaleem commented 2 years ago

If nobody has picked this up yet, I'd like to contribute! Though would first clear up the discussion about whether or not to use a library for this, as @goose3228 mentioned.

roointan commented 2 years ago

Working on this. There is a package https://github.com/grokify/html-strip-tags-go that has an extracted version of the unexported stripTags function in html/template/html.go. I've tested it and it works correctly for the html example provided in this issue. But I guess @bep does not agree with adding more dependencies, specially for an extracted code. I'm trying to add code to support the provided example html code in StripHTML(), But if you think using the extracted code is fine, please let me know. (There had been discussion for exporting the stripTags function here: https://github.com/golang/go/issues/5884 with no success)

jmooring commented 2 years ago

There had been discussion for exporting the stripTags function ... with no success

There have been several discussions, including:

Frustrating...

bluemonday is another possibility, previously considered for other tasks:

t := `<div data-action='click->my-controller#doThing'>qwe</div>`
fmt.Println(bluemonday.StrictPolicy().Sanitize(t)) --> qwe
goose3228 commented 2 years ago

But I guess @bep does not agree with adding more dependencies, specially for an extracted code.

Indeed you don't need a library to save 50 lines of code. The point is that current implementation is too simple to be 100% correct.

A few examples:

On the other hand, if you fix my use case it will lead to more "bug triggers". Currently <> symbols are the only trigger i can imagine, but if you fix it, quotes can become a new one. I suggest the library, because html escaping is not that straightforward.

Go's tpl has 3000 lines, and there is reason for such complexity.

roointan commented 2 years ago

I've tried to use bluemonday, but some tests from different packages failed, because of inconsistencies in how space and escaped characters and and some tags are replaced to produce output.

would it be safe if I change tests, so they meet the output of bluemonday? (for example adjusting expected outputs by removing or adding \n )

jmooring commented 2 years ago

I've tried to use bluemonday

I would wait until the project lead approves the inclusion of an additional dependency.

would it be safe if I change tests

That depends on whether the differences are material. Blindly changing the expected results to match the new output is a bit like scoring your own exam.

roointan commented 2 years ago

@jmooring for a matter of discussion, I thought actually trying to use this library and finding any issues in making it work in place of the current stripHtml() function, can help @bep decide easier. And as you've already mentioned above, bluemonday was already being considered as a candidate. I'm trying to move this forward. Though, if bep doesn't agree, it's all good, and respected.

roointan commented 2 years ago

Added a PR including bluemonday library as a proposed solution. I also had to include the "space" characters manipulation logics, and also stripHTMLReplacer. I preferred not to, and have StripHTML() just strip HTML without other manipulations, but that'd break a lot of tests. Maybe this could be refactored another time. Though, I had to tweak one of the old tests a little bit so that it doesn't fail. Added the HTML code from the original post of this issue, and some others as tests.

github-actions[bot] commented 2 years ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.