philss / floki

Floki is a simple HTML parser that enables search for nodes using CSS selectors.
https://hex.pm/packages/floki
MIT License
2.05k stars 155 forks source link

nth-child selector not working as expected #550

Closed bcardarella closed 3 months ago

bcardarella commented 6 months ago

Flok: v0.36.0

I ran into an unexpected behavior when using nth-child selector today:

header = [
  {"tr",
   [
     {"class", "headerRow"},
     {"data-colval", "A"},
     {"data-coldb", "fleeta"},
     {"data-flt", "A"},
     {"data-chid", "631"}
   ],
   [
     {"td",
      [{"class", "fleet1"}, {"style", "text-align:left"}, {"colspan", "40"}],
      [
        {"span",
         [{"style", "display: flex; justify-content: space-between; gap:8px;"}],
         [
           {"span", [{"class", "headRowStart"}], ["Class: A "]},
           "Start: 18:45:00",
           {"span", [],
            ["Race Len.: ", {"span", [{"class", "raceLength"}], ["3.42"]}]},
           {"span", [], ["Course Desc: 22-19-17-HB"]},
           {"span", [], [{"span", [], ["Rating Type:"]}, "RANDOM LEG - Light"]},
           {"span", [],
            [
              "# of Racers: 9 ",
              {"input",
               [{"type", "hidden"}, {"name", "fleet_row"}, {"value", "A"}], []}
            ]},
           {"span", [],
            [
              "   # of Entries: 14    ",
              {"input",
               [{"type", "hidden"}, {"name", "fleet_row"}, {"value", "A"}], []}
            ]}
         ]}
      ]}
   ]}
]

Floki.find(header, "td span span:nth-child(1)")

this reesults in:

[
  {"span", [{"class", "headRowStart"}], ["Class: A "]},
  {"span", [{"class", "raceLength"}], ["3.42"]},
  {"span", [], ["Rating Type:"]}
]

however in the browser if I do a similar selector on the same fragment:

$0.querySelectorAll('td span span:nth-child(1)')

results correctly in:

<span class="headRowStart">
  <span class="condense" data-dbname="fleeta" data-hideval="A">−</span>
  <span class="expand" data-dbname="fleeta" data-hideval="A" style="display:none">+</span>
  Class: A
</span>

I realize Floki is using Mochi under the hood. But before I go too far down the rabbit hole I wanted to validate if this behavior is expected for Floki or not? If not I will continue to dig and isolate where the issue is.

bcardarella commented 6 months ago

:first-child is producing the same result, which I would expect but wanted to confirm

bcardarella commented 6 months ago

Using the immediate children operator works:

Floki.find(header, "td > span > span:nth-child(1)")

and I believe that Floki is actually correct. So... is Chrome wrong?

ypconstante commented 6 months ago

Can you double check if this is the actual HTML the browser is receiving? This is the raw html for the example you shared

> header |> Floki.raw_html(pretty: true) |> IO.puts
<tr class="headerRow" data-colval="A" data-coldb="fleeta" data-flt="A" data-chid="631">
  <td class="fleet1" style="text-align:left" colspan="40">
    <span style="display: flex; justify-content: space-between; gap:8px;">
      <span class="headRowStart">
        Class: A
      </span>
      Start: 18:45:00
      <span>
        Race Len.:
        <span class="raceLength">
          3.42
        </span>
      </span>
      <span>
        Course Desc: 22-19-17-HB
      </span>
      <span>
        <span>
          Rating Type:
        </span>
        RANDOM LEG - Light
      </span>
      <span>
        # of Racers: 9
        <input type="hidden" name="fleet_row" value="A"/>
      </span>
      <span>
        # of Entries: 14
        <input type="hidden" name="fleet_row" value="A"/>
      </span>
    </span>
  </td>
</tr>

Putting this HTML in the browser and running the queries above gives the same results in Floki, Firefox and Chrome. Since <span class="raceLength">3.42</span> and <span>Rating Type:</span> are the first children for their parents, they are expected to be in the find response.

bcardarella commented 6 months ago

Unfortunately yes the HTML is from an actual site, and it is horrible.

bcardarella commented 6 months ago

Source: https://regattaman.com/results.php?yr=2023&race_id=378&rnum=0&eid=378&sort=0&ssort=12&sdir=true&ssdir=true

I guess a case could be made for not supporting poorly written markup

ypconstante commented 6 months ago

I think you're checking a different element from the one you shared

image

The html structure changes between tables, depending on the displayed data there are nested spans, but comparing the same entries on Floki and Firefox the results are the same.

You'll need to use td > span > span:nth-child(1) to avoid issues when there are nested spans

bcardarella commented 6 months ago

Yes I noted that above https://github.com/philss/floki/issues/550#issuecomment-1975140099

philss commented 3 months ago

Sorry for not being active here. And thank you @ypconstante for the research and replies! ❤️

@bcardarella thanks for the info as well! I believe there is nothing to do here, right? I'm closing now.