j0k3r / graby

Graby helps you extract article content from web pages
MIT License
365 stars 74 forks source link

Graby ignores "http_header(user-agent)" in site config of the target domain when configured "rewrite_url" #210

Open shunf4 opened 5 years ago

shunf4 commented 5 years ago

Graby configuration: (in services.yml on a Wallabag instance)

    wallabag_core.graby:
        class: Graby\Graby
        arguments:
            -
                http_client:
                    ua_browser: 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
                    rewrite_url:
                        'weibo.com':
                            'weibo.com': 'weibo.cn'
                error_message: '%wallabag_core.fetching_error_message%'
                error_message_title: '%wallabag_core.fetching_error_message_title%'
            - "@wallabag_core.http_client"
            - "@wallabag_core.graby.config_builder"

Note I use Googlebot UA and add a redirect rule from "weibo.com" to "weibo.cn".

In site config weibo.cn.txt:

http_header(user-agent): Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gehttp_header(user-agent): Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36
http_header(Cookie): SUB=***;

prune: no

author:string(//*[@id="M_"]/div[1]/a[1]/text())
date:string(//span[@class="ct"]/text())
title:string(//*[@id="M_"])

Note I specify a normal browser UA.

Graby logs when extracting https://weibo.com/5210843684/I2M8O5eCs :

2019-08-18 12:41:35
Graby is ready to fetch
2019-08-18 12:41:35
. looking for site config for weibo.com in primary folder
weibo.com
2019-08-18 12:41:35
Appending site config settings from global.txt
2019-08-18 12:41:35
. looking for site config for global in primary folder
global
2019-08-18 12:41:35
... found site config global.txt
global.txt
2019-08-18 12:41:35
Cached site config with key: weibo.com
weibo.com
2019-08-18 12:41:35
. looking for site config for global in primary folder
global
2019-08-18 12:41:35
... found site config global.txt
global.txt
2019-08-18 12:41:35
Appending site config settings from global.txt
2019-08-18 12:41:35
Cached site config with key: global
global
2019-08-18 12:41:35
Cached site config with key: weibo.com.merged
weibo.com.merged
2019-08-18 12:41:35
Fetching url: https://weibo.com/5210843684/I2M8O5eCs
https://weibo.com/5210843684/I2M8O5eCs
2019-08-18 12:41:35
Trying using method "get" on url "https://weibo.cn/5210843684/I2M8O5eCs"
get
https://weibo.cn/5210843684/I2M8O5eCs
2019-08-18 12:41:35
Use default user-agent "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" for url "https://weibo.cn/5210843684/I2M8O5eCs"
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
https://weibo.cn/5210843684/I2M8O5eCs
2019-08-18 12:41:35
Use default referer "http://www.google.co.uk/url?sa=t&source=web&cd=1" for url "https://weibo.cn/5210843684/I2M8O5eCs"
http://www.google.co.uk/url?sa=t&source=web&cd=1
https://weibo.cn/5210843684/I2M8O5eCs
2019-08-18 12:41:35
Data fetched: [array]
........

Graby used the default UA (Googlebot) instead of the one specified in the weibo.cn.txt site config.