philss / floki

Floki is a simple HTML parser that enables search for nodes using CSS selectors.
https://hex.pm/packages/floki
MIT License
2.07k stars 156 forks source link

:not pseudo-class attribute selector breaks Floki.find/2 with comma-separated selectors #441

Closed florish closed 1 year ago

florish commented 1 year ago

Description

When using Floki.find/2, I found an issue with :not in combination with attribute selectors (e.g. :not([class]) or :not([rel="nofollow"])) and using multiple selectors as a comma-separated string (e.g. a:not([rel="nofollow"]), link[href]).

Floki.find/2 stops returning matches for all CSS selectors after the first :not([attr]) pseudo-class selector in the list of CSS selectors.

To Reproduce

Steps to reproduce the behaviour:

Expected behaviour

I would expect the last Floki.find/2 call to return both the br and hr elements:

> Floki.parse_fragment!("<br/><hr/>") |> Floki.find("br:not([class]), hr")
[{"br", [], []}, {"hr", [], []}]

To be sure: I'm certainly willing to open a PR, but I'd need a bit of guidance as I haven't worked with the Floki source code before.

philss commented 1 year ago

Hey @florish :wave: Thanks for open the issue!

Yeah, this is really strange. For some reason the parsing of the :not pseudo-class is also including the hr.

You can observe that by inspecting the value of pseudo_class here: https://github.com/philss/floki/blob/57c07e00d0983928546de0853d007e16827fbe8d/lib/floki/selector.ex#L197

I think the problem may be at the selector parser, around the do_parse_pseudo_not/2 function: https://github.com/philss/floki/blob/57c07e00d0983928546de0853d007e16827fbe8d/lib/floki/selector/parser.ex#L71

There are tests for this module in here: https://github.com/philss/floki/blob/57c07e00d0983928546de0853d007e16827fbe8d/test/floki/selector/parser_test.exs

Please let me know if you need more guidance. (I also can handle this later this week if you prefer).

florish commented 1 year ago

Hi @philss! Thanks for the quick and detailed response.

After a quick look into the linked code, my conclusion for now is that "I also can handle this later this week" is indeed what I prefer 😄. I'm afraid it'll take me longer than that before I have the time to get this fixed.

This being said, I think I've pinpointed the exact test assertion that should be changed:

https://github.com/philss/floki/blob/57c07e00d0983928546de0853d007e16827fbe8d/test/floki/selector/parser_test.exs#L311

This test is asserting the return value for Parser.parse("a.foo:not([style*=crazy], .bar)") and currently (wrongly!) expects the value to be a list that only contains a %Selector{} for type: "a". In reality, there should also be a second selector matching .bar.

I'm not entirely sure what this .bar %Selector{} should contain, but I hope this saves you some research time.

Thanks again for getting back so quickly, greatly appreciated!

philss commented 1 year ago

This test is asserting the return value for Parser.parse("a.foo:not([style*=crazy], .bar)") and currently (wrongly!) expects the value to be a list that only contains a %Selector{} for type: "a". In reality, there should also be a second selector matching .bar.

In this case, the second selector is going to be inside the "not" pseudo-class - we should have two selectors inside this pseudo-class, and one outside.

@florish thanks for the tips and for opening the issue! :purple_heart:

I should release a new version soon.