philss / floki

Floki is a simple HTML parser that enables search for nodes using CSS selectors.
https://hex.pm/packages/floki
MIT License
2.07k stars 156 forks source link

Unexpected behaviour when source use tabulations #210

Closed Gurimarukin closed 5 years ago

Gurimarukin commented 5 years ago

Hi!

I came across a strange case where selectors weren't working as expected:

def scrap do
  body = """
    <div class="a       ">a</div>
  """ # the white spaces above are tabs

  body |> Floki.find(".a")
end
# => []

If we escape the tabs:

def scrap do
  body = """
    <div class="a       ">a</div>
  """

  Regex.replace(~r/\s+/, body, " ") |> Floki.find(".a")
end
# => [{"div", [{"class", "a "}], ["a"]}]

I confess, this is a case where the class attribute isn't well formatted. But it's the case in the page I'm scrapping from, so I thought it would be interesting to share it here.

philss commented 5 years ago

@Gurimarukin thank you for open the issue, and sorry for the delay.

I think it's something related with how we split the things here: https://github.com/philss/floki/blob/master/lib/floki/selector/attribute_selector.ex#L41

fcapovilla commented 5 years ago

I'll try to fix this issue. :)

philss commented 5 years ago

@fcapovilla Thank you for the PR! :purple_heart:

I'm going to test it this week. It looks like it fixes the problem :)

Gurimarukin commented 5 years ago

It seems that in the meantime, HTTPoison changed something, because I can't reproduce the bug on the body returned from it. (The scraped page still have the weird tabs.)
Anyway, thanks for the answer!

philss commented 5 years ago

Closed by #226. Thank you!