philss / floki

Floki is a simple HTML parser that enables search for nodes using CSS selectors.
https://hex.pm/packages/floki
MIT License
2.05k stars 155 forks source link

Floki.parse differs when using html5ever #236

Open andyleclair opened 4 years ago

andyleclair commented 4 years ago

Description

Mochiweb Floki will produce different output than html5ever, namely, the output of Floki.parse will be wrapped in <html><head></head><body>...</body></html>

To Reproduce

Steps to reproduce the behavior:

defmodule TestCases do
  @test_cases [
    {
      ~s[<a href="javascript:alert('XSS');">Click here</a>],
      ~s[<a href="#">Click here</a>]
    },
    {
      ~s[<a href="whatever" onclick="alert('XSS');">Click here</a>],
      ~s[<a href="whatever">Click here</a>],
    },
    {
      ~s[<body onload="alert('XSS')"><p>Hello</p></body>],
      ~s[<body><p>Hello</p></body>],
    },
    {
      ~s[<img src="javascript:alert('XSS');">],
      ~s[<img src="#"/>],
    },
    {
      ~s[<script>alert('XSS');</script>],
      ~s[],
    },
    {
      ~s[<body background="javascript:alert('XSS');"><p>Hello</p></body>],
      ~s[<body background="#"><p>Hello</p></body>],
    },
    {
      ~s[<style>body { background-image: expression('alert("XSS")'); }</style>],
      ~s[<style>body { background-image: removed_by_strip_js('alert("XSS")'); }</style>],
    },
    {
      ~s[<style>body { background-image: url('javascript:alert("XSS")'); }</style>],
      ~s[<style>body { background-image: url('removed_by_strip_js:alert("XSS")'); }</style>],
    },
    {
      ~s[<style><script>alert('XSS')</script></style>],
      ~s[<style><script>alert('XSS')</script></style>],
    },
    {
      ~s[<style> h1 > a { color: red; } </style>],
      ~s[<style> h1 > a { color: red; } </style>],
    },
    {
      ~s[<],
      ~s[&lt;],
    },
    {
      ~s[>],
      ~s[&gt;],
    },
    {
      ~s[],
      ~s[],
    },
  ]

  def test_cases, do: @test_cases
end

TestCases.test_cases |> Enum.map(fn {ins, _outs} -> Floki.parse(ins) end)

[                                                                                                                                                                                                                                                                                         
  [                                                                                                                                                                                                                                                                                       
    {"html", [],                                                                                                                                                                                                                                                                          
     [                                                                                                                                                                                                                                                                                    
       {"head", [], []},                                                                                                                                                                                                                                                                  
       {"body", [],                                                                                                                                                                                                                                                                       
        [{"a", [{"href", "javascript:alert('XSS');"}], ["Click here"]}]}                                                                                                                                                                                                                  
     ]}                                                                                                                                                                                                                                                                                   
  ],                                                                                                                                                                                                                                                                                      
  [                                                                                                                                                                                                                                                                                       
    {"html", [],                                                                                                                                                                                                                                                                          
     [                                                                                                                                                                                                                                                                                    
       {"head", [], []},                                                                                                                                                                                                                                                                  
       {"body", [],                                                                                                                                                                                                                                                                       
        [
          {"a", [{"href", "whatever"}, {"onclick", "alert('XSS');"}],
           ["Click here"]}
        ]}
     ]}
  ],
  [
    {"html", [],
     [
       {"head", [], []},
       {"body", [{"onload", "alert('XSS')"}], [{"p", [], ["Hello"]}]}
     ]}
  ],
  [
    {"html", [],
     [
       {"head", [], []},
       {"body", [], [{"img", [{"src", "javascript:alert('XSS');"}], []}]}
     ]}
  ],
  [
    {"html", [],
     [{"head", [], [{"script", [], ["alert('XSS');"]}]}, {"body", [], []}]}
  ],
...
]

Expected behavior

I'd expect that the output would match the the output of calling this without the html5ever parser, namely, that it'd just be the fragments themselves.

philss commented 4 years ago

@andyleclair Thank you for opening the issue.

This is a problem that we have because we don't consider parsing fragments as something different, when we should. html5ever's parses fragments as full documents because we (floki) don't distinguish this when calling it.

I'm planning to add a Floki.parse_fragment to differ from the standard Floki.parse because the HTML specs treats them as different algorithms, and with this we can call the correct functions on html5ever's side.

This should be fixed once I finish the work on the internal parser (#204).

andyleclair commented 4 years ago

I see that this report got closed. Was there any resolution? We are currently handling the specific case of a fragment wrapped in the default wrapper, but I'd love to tear that code out

philss commented 4 years ago

@andyleclair it was not fixed. It's a known issue. I kept the issue fixed in the issues list, but I will let it open too.

Matsa59 commented 1 year ago

Is it really a problem from floki? After reading code I start to think it's from html5ever_elixir.