j0k3r / graby

Graby helps you extract article content from web pages
MIT License
363 stars 73 forks source link

Content extraction working sometimes, and failing sometimes for the same link #260

Closed frankhubrepo closed 3 years ago

frankhubrepo commented 3 years ago

Hello, I've had this issue a couple times now. There are some URLs I will try to extract using a graby, and sometimes it works just fine, but some times it fails. This is such a URL: https://www.sinchew.com.my/content/content_2469670.html

I have turned on the logs and here are both the fail log and the success log:

(Fail):

[2021-05-06 00:52:04] graby.INFO: Graby is ready to fetch [] []
[2021-05-06 00:52:04] graby.INFO: . looking for site config for {host} in primary folder {"host":"sinchew.com.my"} []
[2021-05-06 00:52:04] graby.INFO: Appending site config settings from global.txt [] []
[2021-05-06 00:52:04] graby.INFO: . looking for site config for {host} in primary folder {"host":"global"} []
[2021-05-06 00:52:04] graby.INFO: ... found site config {host} {"host":"global.txt"} []
[2021-05-06 00:52:04] graby.INFO: Cached site config with key: {key} {"key":"sinchew.com.my"} []
[2021-05-06 00:52:04] graby.INFO: . looking for site config for {host} in primary folder {"host":"global"} []
[2021-05-06 00:52:04] graby.INFO: ... found site config {host} {"host":"global.txt"} []
[2021-05-06 00:52:04] graby.INFO: Appending site config settings from global.txt [] []
[2021-05-06 00:52:04] graby.INFO: Cached site config with key: {key} {"key":"global"} []
[2021-05-06 00:52:04] graby.INFO: Cached site config with key: {key} {"key":"sinchew.com.my.merged"} []
[2021-05-06 00:52:04] graby.INFO: Fetching url: {url} {"url":"https://www.sinchew.com.my/pad/con/content_2469670.html"} []
[2021-05-06 00:52:04] graby.INFO: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://www.sinchew.com.my/pad/con/content_2469670.html"} []
[2021-05-06 00:52:04] graby.INFO: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://www.sinchew.com.my/pad/con/content_2469670.html"} []
[2021-05-06 00:52:04] graby.INFO: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://www.sinchew.com.my/pad/con/content_2469670.html"} []
[2021-05-06 00:52:07] graby.INFO: Meta refresh redirect found (http-equiv="refresh"), new URL: https://www.sinchew.com.my/pad/con/content_2469670.html?PageSpeed=noscript [] []
[2021-05-06 00:52:07] graby.INFO: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://www.sinchew.com.my/pad/con/content_2469670.html?PageSpeed=noscript"} []
[2021-05-06 00:52:07] graby.INFO: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://www.sinchew.com.my/pad/con/content_2469670.html?PageSpeed=noscript"} []
[2021-05-06 00:52:07] graby.INFO: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://www.sinchew.com.my/pad/con/content_2469670.html?PageSpeed=noscript"} []
[2021-05-06 00:52:12] graby.INFO: Data fetched: {data} {"data":{"effective_url":"https://www.sinchew.com.my/pad/con/content_2469670.html?PageSpeed=noscript","body":"(only length for debug): 117555","headers":{"date":"Wed, 05 May 2021 22:52:08 GMT","content-type":"text/html; charset=UTF-8","transfer-encoding":"chunked","connection":"keep-alive","x-frame-options":"ALLOW-FROM http://newmedia.sinchew.com.my,http://www.sinchew.com.my","x-mod-pagespeed":"1.13.35.2-0","vary":"Accept-Encoding","cache-control":"max-age=0, no-cache, s-maxage=10","strict-transport-security":"max-age=31536000","cf-cache-status":"DYNAMIC","cf-request-id":"09e05448d1000002c6a2882000000001","expect-ct":"max-age=604800, report-uri=\"https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct\"","server":"cloudflare","cf-ray":"64ad8987bd3f02c6-MIA"},"status":200}} []
[2021-05-06 00:52:12] graby.INFO: Treating as UTF-8 {"encoding":"utf-8"} []
[2021-05-06 00:52:12] graby.INFO: Looking for site config files to see if single page link exists [] []
[2021-05-06 00:52:12] graby.INFO: Returning cached and merged site config for {host} {"host":"sinchew.com.my"} []
[2021-05-06 00:52:12] graby.INFO: No "single_page_link" config found [] []
[2021-05-06 00:52:12] graby.INFO: Attempting to extract content [] []
[2021-05-06 00:52:12] graby.INFO: Returning cached and merged site config for {host} {"host":"sinchew.com.my"} []
[2021-05-06 00:52:12] graby.INFO: Strings replaced: {count} (find_string and/or replace_string) {"count":0} []
[2021-05-06 00:52:12] graby.INFO: Attempting to parse HTML with {parser} {"parser":"libxml"} []
[2021-05-06 00:52:12] graby.INFO: Body size after Readability: {length} {"length":71670} []
[2021-05-06 00:52:12] graby.INFO: Opengraph "og:" data: {ogData} {"ogData":{"og_image":"https://cdnpuc.sinchew.com.my/pad/pic/2021-05/01/t2_(2X17X440X269)fc24bb7c-ac17-4f7c-bbbb-554a5d01f29388d283fa-cc53-4c09-b668-6ff70ed61f33.jpg","og_image_width":"1200","og_image_height":"630","og_url":"https://www.sinchew.com.my/content/content_2469670.html","og_type":"article","og_title":"下周三开始提供接种 · AZ疫苗明起开放登记","og_description":"科学、工艺及革新部长凯里宣布,阿斯利康(AstraZeneca)疫苗将于明日开始,开放给自愿接种此疫苗的人士登记接种。"}} []
[2021-05-06 00:52:12] graby.INFO: Opengraph "article:" data: {ogData} {"ogData":[]} []
[2021-05-06 00:52:12] graby.INFO: Trying {pattern} for title {"pattern":"//meta[@property=\"og:title\"]/@content"} []
[2021-05-06 00:52:12] graby.INFO: title matched: {title} {"title":"下周三开始提供接种 · AZ疫苗明起开放登记"} []
[2021-05-06 00:52:12] graby.INFO: ...XPath match: {pattern} ["pattern","//meta[@property=\"og:title\"]/@content"] []
[2021-05-06 00:52:12] graby.INFO: Trying {pattern} for date {"pattern":"//meta[@property=\"article:published_time\"]/@content"} []
[2021-05-06 00:52:12] graby.INFO: Trying {pattern} for language {"pattern":"//html[@lang]/@lang"} []
[2021-05-06 00:52:12] graby.INFO: Trying {pattern} for language {"pattern":"//meta[@name=\"DC.language\"]/@content"} []
[2021-05-06 00:52:12] graby.INFO: Stripping {length} elements with inline display:none or visibility:hidden style {"length":1} []
[2021-05-06 00:52:12] graby.INFO: Using Readability [] []
[2021-05-06 00:52:12] graby.INFO: Date is bad (strtotime failed): {date} {"date":null} []
[2021-05-06 00:52:12] graby.INFO: Success ? {is_success} {"is_success":false} []
[2021-05-06 00:52:12] graby.INFO: Extract failed [] []

Success:

[2021-05-06 11:59:17] graby.INFO: Graby is ready to fetch [] []
[2021-05-06 11:59:17] graby.INFO: . looking for site config for {host} in primary folder {"host":"sinchew.com.my"} []
[2021-05-06 11:59:17] graby.INFO: Appending site config settings from global.txt [] []
[2021-05-06 11:59:17] graby.INFO: . looking for site config for {host} in primary folder {"host":"global"} []
[2021-05-06 11:59:17] graby.INFO: ... found site config {host} {"host":"global.txt"} []
[2021-05-06 11:59:17] graby.INFO: Cached site config with key: {key} {"key":"sinchew.com.my"} []
[2021-05-06 11:59:17] graby.INFO: . looking for site config for {host} in primary folder {"host":"global"} []
[2021-05-06 11:59:17] graby.INFO: ... found site config {host} {"host":"global.txt"} []
[2021-05-06 11:59:17] graby.INFO: Appending site config settings from global.txt [] []
[2021-05-06 11:59:17] graby.INFO: Cached site config with key: {key} {"key":"global"} []
[2021-05-06 11:59:17] graby.INFO: Cached site config with key: {key} {"key":"sinchew.com.my.merged"} []
[2021-05-06 11:59:17] graby.INFO: Fetching url: {url} {"url":"https://www.sinchew.com.my/content/content_2469670.html"} []
[2021-05-06 11:59:17] graby.INFO: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://www.sinchew.com.my/content/content_2469670.html"} []
[2021-05-06 11:59:17] graby.INFO: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://www.sinchew.com.my/content/content_2469670.html"} []
[2021-05-06 11:59:17] graby.INFO: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://www.sinchew.com.my/content/content_2469670.html"} []
[2021-05-06 11:59:20] graby.INFO: Meta refresh redirect found (http-equiv="refresh"), new URL: https://www.sinchew.com.my/content/content_2469670.html?PageSpeed=noscript [] []
[2021-05-06 11:59:20] graby.INFO: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://www.sinchew.com.my/content/content_2469670.html?PageSpeed=noscript"} []
[2021-05-06 11:59:20] graby.INFO: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://www.sinchew.com.my/content/content_2469670.html?PageSpeed=noscript"} []
[2021-05-06 11:59:20] graby.INFO: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://www.sinchew.com.my/content/content_2469670.html?PageSpeed=noscript"} []
[2021-05-06 11:59:23] graby.INFO: Data fetched: {data} {"data":{"effective_url":"https://www.sinchew.com.my/content/content_2469670.html?PageSpeed=noscript","body":"(only length for debug): 88129","headers":{"date":"Thu, 06 May 2021 09:59:20 GMT","content-type":"text/html; charset=UTF-8","transfer-encoding":"chunked","connection":"keep-alive","x-frame-options":"ALLOW-FROM http://newmedia.sinchew.com.my,http://www.sinchew.com.my","x-mod-pagespeed":"1.13.35.2-0","vary":"Accept-Encoding","cache-control":"max-age=0, no-cache, s-maxage=10","strict-transport-security":"max-age=31536000","cf-cache-status":"DYNAMIC","cf-request-id":"09e2b726f4000012aff1185000000001","expect-ct":"max-age=604800, report-uri=\"https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct\"","server":"cloudflare","cf-ray":"64b15aeb2c6212af-MIA"},"status":200}} []
[2021-05-06 11:59:23] graby.INFO: Treating as UTF-8 {"encoding":"utf-8"} []
[2021-05-06 11:59:23] graby.INFO: Looking for site config files to see if single page link exists [] []
[2021-05-06 11:59:23] graby.INFO: Returning cached and merged site config for {host} {"host":"sinchew.com.my"} []
[2021-05-06 11:59:23] graby.INFO: No "single_page_link" config found [] []
[2021-05-06 11:59:23] graby.INFO: Attempting to extract content [] []
[2021-05-06 11:59:23] graby.INFO: Returning cached and merged site config for {host} {"host":"sinchew.com.my"} []
[2021-05-06 11:59:23] graby.INFO: Strings replaced: {count} (find_string and/or replace_string) {"count":0} []
[2021-05-06 11:59:23] graby.INFO: Attempting to parse HTML with {parser} {"parser":"libxml"} []
[2021-05-06 11:59:23] graby.INFO: Body size after Readability: {length} {"length":46106} []
[2021-05-06 11:59:23] graby.INFO: Opengraph "og:" data: {ogData} {"ogData":{"og_image":"https://cdnpuc.sinchew.com.my/pic/2021-05/01/t2_(2X17X440X269)fc24bb7c-ac17-4f7c-bbbb-554a5d01f29388d283fa-cc53-4c09-b668-6ff70ed61f33.jpg","og_image_secure_url":"https://cdnpuc.sinchew.com.my/pic/2021-05/01/t2_(2X17X440X269)fc24bb7c-ac17-4f7c-bbbb-554a5d01f29388d283fa-cc53-4c09-b668-6ff70ed61f33.jpg","og_image_width":"600","og_image_height":"315","og_url":"https://www.sinchew.com.my/content/content_2469670.html","og_type":"article","og_title":"下周三开始提供接种 · AZ疫苗明起开放登记","og_description":"科学、工艺及革新部长凯里宣布,阿斯利康(AstraZeneca)疫苗将于明日开始,开放给自愿接种此疫苗的人士登记接种。"}} []
[2021-05-06 11:59:23] graby.INFO: Opengraph "article:" data: {ogData} {"ogData":[]} []
[2021-05-06 11:59:23] graby.INFO: Trying {pattern} for title {"pattern":"//meta[@property=\"og:title\"]/@content"} []
[2021-05-06 11:59:23] graby.INFO: title matched: {title} {"title":"下周三开始提供接种 · AZ疫苗明起开放登记"} []
[2021-05-06 11:59:23] graby.INFO: ...XPath match: {pattern} ["pattern","//meta[@property=\"og:title\"]/@content"] []
[2021-05-06 11:59:23] graby.INFO: Trying {pattern} for date {"pattern":"//meta[@property=\"article:published_time\"]/@content"} []
[2021-05-06 11:59:23] graby.INFO: Trying {pattern} for language {"pattern":"//html[@lang]/@lang"} []
[2021-05-06 11:59:23] graby.INFO: Trying {pattern} for language {"pattern":"//meta[@name=\"DC.language\"]/@content"} []
[2021-05-06 11:59:23] graby.INFO: Stripping {length} elements with inline display:none or visibility:hidden style {"length":1} []
[2021-05-06 11:59:23] graby.INFO: Using Readability [] []
[2021-05-06 11:59:23] graby.INFO: Date is bad (strtotime failed): {date} {"date":null} []
[2021-05-06 11:59:23] graby.INFO: Detecting body [] []
[2021-05-06 11:59:23] graby.INFO: Pruning content [] []
[2021-05-06 11:59:23] graby.INFO: Success ? {is_success} {"is_success":true} []
[2021-05-06 11:59:23] graby.INFO: Filtering HTML to remove XSS [] []
[2021-05-06 11:59:23] graby.INFO: Returning data (most interesting ones): {data} {"data":{"html":"(only length for debug): 2527","status":200,"title":"下周三开始提供接种 · AZ疫苗明起开放登记","language":null,"date":null,"authors":[],"url":"https://www.sinchew.com.my/content/content_2469670.html?PageSpeed=noscript","image":"https://cdnpuc.sinchew.com.my/pic/2021-05/01/t2_(2X17X440X269)fc24bb7c-ac17-4f7c-bbbb-554a5d01f29388d283fa-cc53-4c09-b668-6ff70ed61f33.jpg","native_ad":false,"headers":{"date":"Thu, 06 May 2021 09:59:20 GMT","content-type":"text/html; charset=UTF-8","transfer-encoding":"chunked","connection":"keep-alive","x-frame-options":"ALLOW-FROM http://newmedia.sinchew.com.my,http://www.sinchew.com.my","x-mod-pagespeed":"1.13.35.2-0","vary":"Accept-Encoding","cache-control":"max-age=0, no-cache, s-maxage=10","strict-transport-security":"max-age=31536000","cf-cache-status":"DYNAMIC","cf-request-id":"09e2b726f4000012aff1185000000001","expect-ct":"max-age=604800, report-uri=\"https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct\"","server":"cloudflare","cf-ray":"64b15aeb2c6212af-MIA"}}} []

There isn't a lot of helpful data on the fail log. It just says "is_success:false" but it doesn't really say why it failed. The only difference I could make out of the two is that in the success case, it starts trying to detect the body after trying to get the date. In the fail case it never says "detecting body".

I also tried creating a config file for this site (and also for businesstimes.com.ng) like it says here: https://doc.wallabag.org/en/user/errors_during_fetching.html But it seems the api is just ignoring it. It doesn't seem to make any difference

j0k3r commented 3 years ago

It looks like it's not the same url. The first one is: https://www.sinchew.com.my/pad/con/content_2469670.html The second one is: https://www.sinchew.com.my/content/content_2469670.html

frankhubrepo commented 3 years ago

It looks like it's not the same url. The first one is: https://www.sinchew.com.my/pad/con/content_2469670.html The second one is: https://www.sinchew.com.my/content/content_2469670.html

That really is my bad.

However I am having trouble with the config files for another site (businesstimes.com.ng). Should I open another issue fot it?

j0k3r commented 3 years ago

Yep open an other issue