simonw / strip-tags

CLI tool for stripping tags from HTML
Apache License 2.0
209 stars 6 forks source link

Unexpected behavior for multiple nested tag selectors #21

Open benjamin-kirkbride opened 1 year ago

benjamin-kirkbride commented 1 year ago
$ echo "Hello <b><i>Beautiful</i> World</b>" | strip-tags b i
Beautiful WorldBeautiful

I would expect the output of this to be Beautiful World.

Thoughts?

benjamin-kirkbride commented 1 year ago

To be clear, I'm not asking "why does it do this?"; I'm asking "is this the expected behavior? is this a feature or a bug?"

benjamin-kirkbride commented 1 year ago

A somewhat more dramatic example:

echo "hello <b>world</b>" | strip-tags b b b b b
worldworldworldworldworld
benjamin-kirkbride commented 1 year ago

So.. this is a hard problem. This behavior is actually almost the same as bs4's find_all method:

from bs4 import BeautifulSoup

html_doc = '''
<html>
<body>
<h1>My HTML Document</h1>
<p class="content">This is the first paragraph.</p>
<p class="content">This is the second paragraph.</p>
<ul>
  <li>Item 1</li>
  <li>Item 2</li>
  <li>Item 3</li>
</ul>
</body>
</html>
'''

soup = BeautifulSoup(html_doc, 'html.parser')
soup.find_all(["p", "li", "ul"])
[<p class="content">This is the first paragraph.</p>, <p class="content">This is the second paragraph.</p>, <ul>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
</ul>, <li>Item 1</li>, <li>Item 2</li>, <li>Item 3</li>]

The difference is the redundant selectors appear to get truncated.

This may need to be considered a feature, not a bug...

benjamin-kirkbride commented 1 year ago

Okay I have done some digging and given this a good bit of thought. There are a number of different facets of this. Some of them are (IMO) definitely not intended.

TL:DR: We should require a string that can be parsed by https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors-through-the-css-property for the "selectors" argument of strip-tags.

HTML Used for All Below Examples

```html My Page

My HTML Document

Hello World!

This is the first paragraph.

This is the second paragraph.

  • Item 1
  • Item 2
  • Item 3
Example Com Example Net ```

BeautifulSoup Used for Examples

```python from pathlib import Path from bs4 import BeautifulSoup soup = BeautifulSoup( (Path(__file__).parent / "simple.html").read_text(), "html5lib", multi_valued_attributes=False, ) # whatever selections/finds against `soup` # print ```

Redundant selector duplication

$ strip-tags b b b b < scratch/simple.html 
World!World!World!World!
$ strip-tags "#item1" "#item1" "#item1" < scratch/simple.html 
Item 1
Item 1
Item 1

I think that this is a bug, but it is also very easy to resolve.

Compare the above behavior to how

Compare the above behavior to the behavior of soup.select("b,b,b,b"):

[<b>World!</b>]

and against soup.find_all(["b","b","b","b"])

[<b>World!</b>]

(note find_all doesn't work with CSS selectors, so it's not a apples to apples comparison, but it is illustrative of how these things are handled, seemingly).

This first one is actually really easy to solve AFAIK; change for selector in selectors to for selector in set(selectors), or something to that effect.

Lack of Whitespace Between Results

There is currently a list of elements that receive a newline, all other elements do not receive any whitespace. This can be problematic:

$ strip-tags a < scratch/simple.html 
Example ComExample Net

I think all elements should be separated by a space at a minimum, but maybe I'm missing something here.

Output Order Dependent on Order of Selectors

strip-tags output depends on the order of the selectors:

$ strip-tags a div < scratch/simple.html 
Example ComExample NetHello World!
$ strip-tags div a < scratch/simple.html 
Hello World!
Example ComExample Net

Compare this to:

print(soup.select("a,div") == soup.select("div,a")) # same for `find_all`
# True

The order appears to be based on the order in the document the elements are found:

print(soup.select("a,div,title,h1,#item1"))
[<title>My Page</title>, <h1>My HTML Document</h1>, <div>Hello <b>World!</b></div>, <li id="item1">Item 1</li>, <a href="https://www.example.com">Example Com</a>, <a href="https://www.example.net">Example Net</a>]

I think that for the intended use of strip-tags we would want the order of the output to both be dependent on the originality of the document, not of the selectors. I think the easiest way to achieve this would probably be to use the select method with a comma separated string of the selectors used.

Having the "selectors" argument of strip-tags just be passed directly to select makes the most sense to solve this IMO.

Duplication of Nested Elements with Nested Results of Selectors

This I think is actually a "feature" not a bug. From my understanding, it's the nature of the beast. The enhanced flexibility that passing a string to the select method will provide us will help quite a bit with this, for advanced users who truly need to be able to do such things.

This is the actual problem that was faced here btw: https://github.com/simonw/strip-tags/pull/18#issuecomment-1600465929

benjamin-kirkbride commented 1 year ago

I am going to start experimenting with an implementation that uses select with a selector string directly hopefully tomorrow. Let me know if you have any opposing or supplemental thoughts or input in the meantime.

benjamin-kirkbride commented 1 year ago

This was not as straight forward as I had thought. select is not a magic bullet for this issue.

It does address:

It does not address:

benjamin-kirkbride commented 1 year ago

I think that "Lack of Whitespace Between Results" and "Duplication of Nested Elements with Nested Results of Selectors" are both not bugs, unfortunately. We just need to discourage users from ending up in that situation with good docs and a FAQ or something.