rushter / selectolax

Python binding to Modest and Lexbor engines (fast HTML5 parser with CSS selectors).
MIT License
1.14k stars 69 forks source link

Awful text parsing issue #101

Closed mhillebrand closed 1 year ago

mhillebrand commented 1 year ago

When parsing the pairs of <p> and <div> tags below, something odd happens. The first <div>'s text erroneously contains the text for all subsequent <div> tags.

from selectolax.parser import HTMLParser

html = """
<html>
   <body>
      <div class="List">
         <p>3</p>
         <div>tablespoons butter</div>

         <p>1</p>
         <div>cup chopped onion</div>

         <p>6</p>
         <div>large fresh thyme sprigs</div>

         <p>1</p>
         <div>large garlic clove, chopped</div>

         <p>2</p>
         <div>1-pound bags peeled baby carrots</div>

         <p>2</p>
         <div>cups low-salt chicken broth</div>
      </div>
   </body>
</html>
"""

lax = HTMLParser(html)

ingredients = lax.css_first('div.List')

amount_nodes = ingredients.css('p')
amounts = [ing.text() for ing in amount_nodes]

name_nodes = ingredients.css('div')
names = [ing.text() for ing in name_nodes]

print(names[0])

Here's what names looks like:

image

However, if I change my parsing logic to the following, everything works as expected:

amount_nodes = lax.css('div.List p')
amounts = [ing.text() for ing in amount_nodes]

name_nodes = lax.css('div.List div')
names = [ing.text() for ing in name_nodes]

Thoughts?

rushter commented 1 year ago

The css selector starts from the current node, you are basically selecting the main div again.

mhillebrand commented 1 year ago

I'm afraid I don't understand. The current node (ingredients) is <div class="List">, correct? Why are <p> tags showing up when I invoke ingredients.css('div')?

rushter commented 1 year ago

Because that’s how CSS selectors work, you ask for any P tags that are on any level. When you select div, the first match is on the current level, so any elements inside it are ignored and not selected.

On Sun, 15 Oct 2023 at 20:47, mhillebrand @.***> wrote:

I'm afraid I don't understand. The current node (ingredients) is <div class="List">, correct? Why are

tags showing up when I invoke ingredients.css('div')?

— Reply to this email directly, view it on GitHub https://github.com/rushter/selectolax/issues/101#issuecomment-1763444798, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYKJ332TCZ7O65CBIZQZZTX7QHSJAVCNFSM6AAAAAA57WLSUSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRTGQ2DINZZHA . You are receiving this because you commented.Message ID: @.***>

mhillebrand commented 1 year ago

In the following code, names[0] and names2[0] are identical. I was under the assumption that chaining like lax.css_first('div.List').css('div') was supported. Is that an incorrect assumption?

lax = HTMLParser(html)

names = lax.css_first('div.List').css('div')
names2 = lax.css('div')
rushter commented 1 year ago

It's supported. The way it works is it includes the current node when executing the css selector. You start at <div class=List>

mhillebrand commented 1 year ago

Bummer. Okay, I guess I just won't use chaining. Thanks.