Awful text parsing issue

mhillebrand commented 1 year ago

When parsing the pairs of <p> and <div> tags below, something odd happens. The first <div>'s text erroneously contains the text for all subsequent <div> tags.

from selectolax.parser import HTMLParser

html = """
<html>
   <body>
      <div class="List">
         <p>3</p>
         <div>tablespoons butter</div>

         <p>1</p>
         <div>cup chopped onion</div>

         <p>6</p>
         <div>large fresh thyme sprigs</div>

         <p>1</p>
         <div>large garlic clove, chopped</div>

         <p>2</p>
         <div>1-pound bags peeled baby carrots</div>

         <p>2</p>
         <div>cups low-salt chicken broth</div>
      </div>
   </body>
</html>
"""

lax = HTMLParser(html)

ingredients = lax.css_first('div.List')

amount_nodes = ingredients.css('p')
amounts = [ing.text() for ing in amount_nodes]

name_nodes = ingredients.css('div')
names = [ing.text() for ing in name_nodes]

print(names[0])

Here's what names looks like:

However, if I change my parsing logic to the following, everything works as expected:

amount_nodes = lax.css('div.List p')
amounts = [ing.text() for ing in amount_nodes]

name_nodes = lax.css('div.List div')
names = [ing.text() for ing in name_nodes]

Thoughts?

rushter commented 1 year ago

The css selector starts from the current node, you are basically selecting the main div again.

mhillebrand commented 1 year ago

I'm afraid I don't understand. The current node (ingredients) is <div class="List">, correct? Why are <p> tags showing up when I invoke ingredients.css('div')?

rushter commented 1 year ago

Because that’s how CSS selectors work, you ask for any P tags that are on any level. When you select div, the first match is on the current level, so any elements inside it are ignored and not selected.

On Sun, 15 Oct 2023 at 20:47, mhillebrand @.***> wrote:

I'm afraid I don't understand. The current node (ingredients) is <div class="List">, correct? Why are
tags showing up when I invoke ingredients.css('div')?

— Reply to this email directly, view it on GitHub https://github.com/rushter/selectolax/issues/101#issuecomment-1763444798, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYKJ332TCZ7O65CBIZQZZTX7QHSJAVCNFSM6AAAAAA57WLSUSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRTGQ2DINZZHA . You are receiving this because you commented.Message ID: @.***>

mhillebrand commented 1 year ago

In the following code, names[0] and names2[0] are identical. I was under the assumption that chaining like lax.css_first('div.List').css('div') was supported. Is that an incorrect assumption?

lax = HTMLParser(html)

names = lax.css_first('div.List').css('div')
names2 = lax.css('div')

rushter commented 1 year ago

It's supported. The way it works is it includes the current node when executing the css selector. You start at <div class=List>

mhillebrand commented 1 year ago

Bummer. Okay, I guess I just won't use chaining. Thanks.

rushter / selectolax

Awful text parsing issue #101