kovidgoyal / html5-parser

Fast C based HTML 5 parsing for python
Apache License 2.0
678 stars 33 forks source link

BeautifulSoup treebuilder not bringing in element classes properly? #8

Closed bradbeattie closed 6 years ago

bradbeattie commented 6 years ago

Maybe I'm mistaken in that I can't use html-parser as a faster drop-in replacement for bs4. Two test cases fail:

import bs4
import html5_parser

html = """<html><body><table class="table_class"></table></body></html>"""
expected_attrs = {"class": ["table_class"]}

def test_soup(soup):
    table = soup.select("table")[0]
    assert table.attrs == expected_attrs, table.attrs
    table = soup.select("table.table_class")[0]
    assert table.attrs == expected_attrs, table.attrs

test_soup(bs4.BeautifulSoup(html, "html.parser"))
test_soup(html5_parser.parse(html, treebuilder="soup"))

First, soup.select("table.table_class") doesn't find anything with html5-parser. Further, when searching with just soup.select("table"), the class attribute isn't a list, but is instead a space-delimited string.

kovidgoyal commented 6 years ago

I use bs3 myself, so I am not particularly familiar with bs4, but doing:

python3 -c "import bs4; print(bs4.Tag(name='x', attrs={'class':'a b'}).attrs)"
{'class': 'a b'}

Doesn't give a list either. Which is the expected behavior from bs3. So I'm not sure what html5-parser is doing wrong here, if anything.

bradbeattie commented 6 years ago

It might well be that the expected use of the Tag constructor is python3 -c "import bs4; print(bs4.Tag(name='x', attrs={'class':['a', 'b']}).attrs)", but not being at my terminal at the moment, I can't verify this suspicion.

kovidgoyal commented 6 years ago

Already fixed.