Open benjamin-kirkbride opened 1 year ago
To be clear, I'm not asking "why does it do this?"; I'm asking "is this the expected behavior? is this a feature or a bug?"
A somewhat more dramatic example:
echo "hello <b>world</b>" | strip-tags b b b b b
worldworldworldworldworld
So.. this is a hard problem. This behavior is actually almost the same as bs4's find_all
method:
from bs4 import BeautifulSoup
html_doc = '''
<html>
<body>
<h1>My HTML Document</h1>
<p class="content">This is the first paragraph.</p>
<p class="content">This is the second paragraph.</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
</ul>
</body>
</html>
'''
soup = BeautifulSoup(html_doc, 'html.parser')
soup.find_all(["p", "li", "ul"])
[<p class="content">This is the first paragraph.</p>, <p class="content">This is the second paragraph.</p>, <ul>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
</ul>, <li>Item 1</li>, <li>Item 2</li>, <li>Item 3</li>]
The difference is the redundant selectors appear to get truncated.
This may need to be considered a feature, not a bug...
Okay I have done some digging and given this a good bit of thought. There are a number of different facets of this. Some of them are (IMO) definitely not intended.
TL:DR: We should require a string that can be parsed by https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors-through-the-css-property for the "selectors" argument of strip-tags
.
```html
This is the first paragraph.
This is the second paragraph.
```python from pathlib import Path from bs4 import BeautifulSoup soup = BeautifulSoup( (Path(__file__).parent / "simple.html").read_text(), "html5lib", multi_valued_attributes=False, ) # whatever selections/finds against `soup` # print ```
$ strip-tags b b b b < scratch/simple.html
World!World!World!World!
$ strip-tags "#item1" "#item1" "#item1" < scratch/simple.html
Item 1
Item 1
Item 1
I think that this is a bug, but it is also very easy to resolve.
Compare the above behavior to how
Compare the above behavior to the behavior of soup.select("b,b,b,b")
:
[<b>World!</b>]
and against soup.find_all(["b","b","b","b"])
[<b>World!</b>]
(note find_all
doesn't work with CSS selectors, so it's not a apples to apples comparison, but it is illustrative of how these things are handled, seemingly).
This first one is actually really easy to solve AFAIK; change for selector in selectors
to for selector in set(selectors)
, or something to that effect.
There is currently a list of elements that receive a newline, all other elements do not receive any whitespace. This can be problematic:
$ strip-tags a < scratch/simple.html
Example ComExample Net
I think all elements should be separated by a space at a minimum, but maybe I'm missing something here.
strip-tags
output depends on the order of the selectors:
$ strip-tags a div < scratch/simple.html
Example ComExample NetHello World!
$ strip-tags div a < scratch/simple.html
Hello World!
Example ComExample Net
Compare this to:
print(soup.select("a,div") == soup.select("div,a")) # same for `find_all`
# True
The order appears to be based on the order in the document the elements are found:
print(soup.select("a,div,title,h1,#item1"))
[<title>My Page</title>, <h1>My HTML Document</h1>, <div>Hello <b>World!</b></div>, <li id="item1">Item 1</li>, <a href="https://www.example.com">Example Com</a>, <a href="https://www.example.net">Example Net</a>]
I think that for the intended use of strip-tags
we would want the order of the output to both be dependent on the originality of the document, not of the selectors. I think the easiest way to achieve this would probably be to use the select
method with a comma separated string of the selectors used.
Having the "selectors" argument of strip-tags
just be passed directly to select
makes the most sense to solve this IMO.
This I think is actually a "feature" not a bug. From my understanding, it's the nature of the beast. The enhanced flexibility that passing a string to the select
method will provide us will help quite a bit with this, for advanced users who truly need to be able to do such things.
This is the actual problem that was faced here btw: https://github.com/simonw/strip-tags/pull/18#issuecomment-1600465929
I am going to start experimenting with an implementation that uses select
with a selector string directly hopefully tomorrow. Let me know if you have any opposing or supplemental thoughts or input in the meantime.
This was not as straight forward as I had thought. select
is not a magic bullet for this issue.
It does address:
It does not address:
output = soup.select("b,i")
# <p>Hello <b><i>Beautiful</i> World</b></p>
print("".join([node.text for node in output]))
# Beautiful WorldBeautiful
strip-tags
I think that "Lack of Whitespace Between Results" and "Duplication of Nested Elements with Nested Results of Selectors" are both not bugs, unfortunately. We just need to discourage users from ending up in that situation with good docs and a FAQ or something.
I would expect the output of this to be
Beautiful World
.Thoughts?