j0k3r / graby

Graby helps you extract article content from web pages
MIT License
363 stars 73 forks source link

Site config file not working #261

Open frankhubrepo opened 3 years ago

frankhubrepo commented 3 years ago

I am trying to fetch the content from this article: https://www.businesstimes.com.sg/government-economy/malaysia-likely-to-impose-fresh-mco-in-kl-other-areas-report

However as it doesn't work, i tried adding a config file as shown here: https://doc.wallabag.org/en/user/errors_during_fetching.html

This is the code within the config file:

title://body//h1[@class="headline"]

body://body//div[contains(@class, "field-type-text-with-summary")]

test_url: https://www.businesstimes.com.sg/government-economy/malaysia-likely-to-impose-fresh-mco-in-kl-other-areas-report

The issue is even then I don't get the content, and I know the query is right because i can see it in the browser console: image

image

Also here is the log:

[2021-05-06 19:40:53] graby.INFO: Graby is ready to fetch [] []
[2021-05-06 19:40:53] graby.INFO: . looking for site config for {host} in primary folder {"host":"businesstimes.com.sg"} []
[2021-05-06 19:40:53] graby.INFO: ... found site config {host} {"host":"businesstimes.com.sg.txt"} []
[2021-05-06 19:40:53] graby.INFO: Appending site config settings from global.txt [] []
[2021-05-06 19:40:53] graby.INFO: . looking for site config for {host} in primary folder {"host":"global"} []
[2021-05-06 19:40:53] graby.INFO: ... found site config {host} {"host":"global.txt"} []
[2021-05-06 19:40:53] graby.INFO: Cached site config with key: {key} {"key":"businesstimes.com.sg"} []
[2021-05-06 19:40:53] graby.INFO: . looking for site config for {host} in primary folder {"host":"global"} []
[2021-05-06 19:40:53] graby.INFO: ... found site config {host} {"host":"global.txt"} []
[2021-05-06 19:40:53] graby.INFO: Appending site config settings from global.txt [] []
[2021-05-06 19:40:53] graby.INFO: Cached site config with key: {key} {"key":"global"} []
[2021-05-06 19:40:53] graby.INFO: Cached site config with key: {key} {"key":"businesstimes.com.sg.merged"} []
[2021-05-06 19:40:53] graby.INFO: Fetching url: {url} {"url":"https://www.businesstimes.com.sg/government-economy/malaysia-likely-to-impose-fresh-mco-in-kl-other-areas-report"} []
[2021-05-06 19:40:53] graby.INFO: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://www.businesstimes.com.sg/government-economy/malaysia-likely-to-impose-fresh-mco-in-kl-other-areas-report"} []
[2021-05-06 19:40:53] graby.INFO: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://www.businesstimes.com.sg/government-economy/malaysia-likely-to-impose-fresh-mco-in-kl-other-areas-report"} []
[2021-05-06 19:40:53] graby.INFO: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://www.businesstimes.com.sg/government-economy/malaysia-likely-to-impose-fresh-mco-in-kl-other-areas-report"} []
[2021-05-06 19:41:02] graby.INFO: Data fetched: {data} {"data":{"effective_url":"https://www.businesstimes.com.sg/government-economy/malaysia-likely-to-impose-fresh-mco-in-kl-other-areas-report","body":"(only length for debug): 152622","headers":{"alt-svc":"clear","cache-control":"no-cache, no-store, must-revalidate","content-type":"text/html; charset=UTF-8","date":"Thu, 06 May 2021 17:40:53 GMT","expires":"0","istl-response":"1","pragma":"no-cache","referrer-policy":"no-referrer-when-downgrade, no-referrer-when-downgrade","server":"ECD (sgb/C7A3)","via":"1.1 google","x-ion-hop":"true","x-vmg-version":"v2.3.21","content-length":"152622"},"status":200}} []
[2021-05-06 19:41:02] graby.INFO: Treating as UTF-8 {"encoding":"utf-8"} []
[2021-05-06 19:41:03] graby.INFO: Looking for site config files to see if single page link exists [] []
[2021-05-06 19:41:03] graby.INFO: Returning cached and merged site config for {host} {"host":"businesstimes.com.sg"} []
[2021-05-06 19:41:03] graby.INFO: No "single_page_link" config found [] []
[2021-05-06 19:41:03] graby.INFO: Attempting to extract content [] []
[2021-05-06 19:41:03] graby.INFO: Returning cached and merged site config for {host} {"host":"businesstimes.com.sg"} []
[2021-05-06 19:41:03] graby.INFO: Strings replaced: {count} (find_string and/or replace_string) {"count":0} []
[2021-05-06 19:41:03] graby.INFO: Attempting to parse HTML with {parser} {"parser":"libxml"} []
[2021-05-06 19:41:03] graby.INFO: Body size after Readability: {length} {"length":96} []
[2021-05-06 19:41:03] graby.INFO: Opengraph "og:" data: {ogData} {"ogData":[]} []
[2021-05-06 19:41:03] graby.INFO: Opengraph "article:" data: {ogData} {"ogData":[]} []
[2021-05-06 19:41:03] graby.INFO: Trying {pattern} for title {"pattern":"//body//h1[@class=\"headline\"]"} []
[2021-05-06 19:41:03] graby.INFO: Trying {pattern} for title {"pattern":"//meta[@property=\"og:title\"]/@content"} []
[2021-05-06 19:41:03] graby.INFO: Trying {pattern} for date {"pattern":"//meta[@property=\"article:published_time\"]/@content"} []
[2021-05-06 19:41:03] graby.INFO: Trying {pattern} for language {"pattern":"//html[@lang]/@lang"} []
[2021-05-06 19:41:03] graby.INFO: Trying {pattern} for language {"pattern":"//meta[@name=\"DC.language\"]/@content"} []
[2021-05-06 19:41:03] graby.INFO: Trying {pattern} for body (content length: {content_length}) {"pattern":"//body//div[contains(@class, \"field-type-text-with-summary\")]","content_length":96} []
[2021-05-06 19:41:03] graby.INFO: Using Readability [] []
[2021-05-06 19:41:03] graby.INFO: Date is bad (strtotime failed): {date} {"date":null} []
[2021-05-06 19:41:03] graby.INFO: Success ? {is_success} {"is_success":false} []
[2021-05-06 19:41:03] graby.INFO: Extract failed [] []

Any insight on what could be happening here or something I'm missing?

hwiorn commented 3 years ago

Recently, I've tried to make site-configs for wallabag server and I noticed some XPATH problem like this issue. You should check log/html.log. Graby uses the php-readability to process HTML, and it strips and flats many tags for readability. This mean XPATHs of a site-config won't be the same like XPATHs of browsers and you can't use them in the site-config directly.

In my case, I wanted to extract a "real" author and a "real" title from an article in some website. But I got nothing after processing. Even though, I used XPATHs which work correctly in Chrome and Firefox browser. I can't use https://siteconfig.fivefilters.org/ because it doesn't show CSS and XPATH bar in bottom when I tested that websites.

Put the debug settings in your some-graby-test.php file and run it.

$graby = new Graby([
    'debug' => true,
    'log_leve' => 'debug',
]);

Then, you can see the log/html.log file.

j0k3r commented 2 years ago

The problem is that Graby is retrieving that HTML: response.html.txt Which is definitely not the one you are querying from your browser console.

Maybe we need to add some cookie for the request. I've tried some without success.