j0k3r / graby

Graby helps you extract article content from web pages
MIT License
363 stars 73 forks source link

Article content inside a javascript file #254

Closed mart-e closed 3 years ago

mart-e commented 3 years ago

Hi,

I am trying to fetch the content from the following content https://kont.me/%C3%A9loge-d%C3%A9croissance-individuelle but the page source is pretty minimal...

<!doctype html><html lang="fr">
...
<body><div id="app"></div><script src="/index.js"></script><script src="javascript.js"></script></body></html>

The whole article is actually inside the minimized javascript.js file, alongside all the other articles, I am not sure why this was made this way but it means the javascript should be evaluated to retrieve the content. I guess this changes a lot how graby works...

edit: apparently, it's a react website: https://github.com/laem/blog

j0k3r commented 3 years ago

I guess this changes a lot how graby works...

You're right. For now, we only retrieve the generated HTML to find the content. If we need to find the content using the evaluated HTML, we need an headless browser for that and it means a lot of changes on graby.

mart-e commented 3 years ago

Yeah, I guess that won't be for a close future. I will try to find alternative ways like having an RSS feed or something machine-readable.

laem commented 3 years ago

Thanks @mart-e for the mention, it's my blog :angel:

There may be a simple solution. The blog is hosted on Netlify which runs somthing like prerender.io (needs to be activated in the options, which I did).

This means Netlify serves to some user agents a precompiled version of the HTML.

You can check this simply by running these thos commands :

curl https://kont.me/éloge-décroissance-individuelle
curl -A twitterbot   https://kont.me/éloge-décroissance-individuelle

The output should be a couple of lines for the first one, a whole HTML article for the second one.

Here is Netlify's page that documents this behaviour. https://answers.netlify.com/t/support-guide-understanding-and-debugging-prerendering/150

In particular :

There are a few dozen user agents that a request can present that will cause a prerendered response, for instance Twitterbot and facebookexternalhit/1.0 . All crawlers that we are aware of which need prerendering receive this treatment automatically, but if yours seems to be missing, please mention it below so we can potentially add it to the list!

So the solution might just be to add Wallabag or Graby's user agent to the list :)

mart-e commented 3 years ago

Interesting, thanks @laem for the tip !

In my logs, I see

[2021-04-08 13:42:12] graby.INFO: Trying using method "get" on url "https://kont.me/%C3%A9loge-d%C3%A9croissance-individuelle" {"method":"get","url":"https://kont.me/%C3%A9loge-d%C3%A9croissance-individuelle"} []
[2021-04-08 13:42:12] graby.INFO: Use default user-agent "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2" for url "https://kont.me/%C3%A9loge-d%C3%A9croissance-individuelle" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://kont.me/%C3%A9loge-d%C3%A9croissance-individuelle"} []
[2021-04-08 13:42:12] graby.INFO: Use default referer "http://www.google.co.uk/url?sa=t&source=web&cd=1" for url "https://kont.me/%C3%A9loge-d%C3%A9croissance-individuelle" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://kont.me/%C3%A9loge-d%C3%A9croissance-individuelle"} []

@j0k3r do you support custom user-agent? I see we can do it in https://help.fivefilters.org/full-text-rss/site-patterns.html#pattern-format

Tested on https://f43.me/feed/test with the custom config

http_header(User-agent): twitterbot

and it was retrieved correctly. I will submit a PR on https://github.com/fivefilters/ftr-site-config

Kdecherf commented 3 years ago

@mart-e yes http_header(User-agent) is supported by graby :)

mart-e commented 3 years ago

Great, thanks everybody for the quick reaction and tips !