philss / floki

Floki is a simple HTML parser that enables search for nodes using CSS selectors.
https://hex.pm/packages/floki
MIT License
2.07k stars 156 forks source link

Not all meta tags are found #349

Closed rouven-uncap closed 3 years ago

rouven-uncap commented 3 years ago

Description

Parsing a Tweet html, Floki only returns the initial 11 <meta> tags. Notice in the document from twitter.com below, several <script> and <link> tags follow, and then additional <meta> tags are present. These are not found by Floki.

To Reproduce

Steps to reproduce the behavior:

All <meta> tags should be returned by the query

philss commented 3 years ago

@rouven-uncap for what I could see, they are all returned:

floki-twitter-meta

Maybe Twitter is returning a different result when downloading using HTTPoison. I downloaded the file using curl and then I opened in my iex. Please try again.

rouven-uncap commented 3 years ago

If you look further down below the link and script tags, there are more meta tags below, which are not returned in your screenshot:

image

rouven-uncap commented 3 years ago

You are correct about curl. Maybe those extra meta tags are injected into the DOM via JS after the page is loaded, which completely defeats the purpose of meta tags...