j0k3r / graby

Graby helps you extract article content from web pages
MIT License
363 stars 73 forks source link

Graby.php fails with PREG_JIT_STACKLIMIT_ERROR #223

Closed janxbar closed 4 years ago

janxbar commented 4 years ago

Hi,

Graby fails on Oracle blog entries, for example this blog entry. The reason is that line 1138 is too long for JITed PCRE. Graby.php calls preg_replace() at line 315 and it fails with PREG_JIT_STACKLIMIT_ERROR. You should check error code with preg_last_error(). The problem can be fixed with ini_set('pcre.jit', false) but there might be better solution.

Thank you for your support. I am using Graby in Wallabag application and I cannot extract blog text.

Kind regards, Jan

j0k3r commented 4 years ago

Could you please share a stack trace, what PHP version are you using, etc? Because I cannot reproduce the bug on my machine.

janxbar commented 4 years ago

Thank you for your time. I am using app.wallabag.it. Wallabag cannot store the blog entry above (in fact most of Oracle blog entries) and says "wallabag can't retrieve contents for this article". So I followed troubleshooting guide and tried the blog entry with https://f43.me/feed/test and the data are lost after "HTML after regex empty nodes stripping" step. Then I checked the code and tested possible reasons on my hosting and found out the bug above. The PHP version there can be found here: https://www.webhosting-c4.cz/php71info (ano = yes, zapnuto = on). Unfortunately phpinfo() is disabled.

Kind regards, Jan

j0k3r commented 4 years ago

You were right! See https://3v4l.org/YWq85

I'll fix it

janxbar commented 4 years ago

Thank you. I hope, I will see the fix in Wallabag.it soon. Jan

nicosomb commented 4 years ago

@j0k3r Could you please release a new version please?

j0k3r commented 4 years ago

@nicosomb https://github.com/j0k3r/graby/releases/tag/2.2.0

nicosomb commented 4 years ago

wallabag.it was updated.

But I have this error:

[2020-04-22 16:41:15] app.ERROR: Error: cURL error 28: Operation timed out after 10001 milliseconds with 0 bytes received when sending request: GET https://blogs.oracle.com/dave/java-contended-annotation-to-help-reduce-false-sharing 1.1 {"request":"[object] (GuzzleHttp\Psr7\Request: {})","exception":"[object] (Http\Client\Exception\NetworkException(code: 0): cURL error 28: Operation timed out after 10001 milliseconds with 0 bytes received at /var/www/wallabag.it/app2019/vendor/php-http/guzzle5-adapter/src/Client.php:116, GuzzleHttp\Exception\ConnectException(code: 0): cURL error 28: Operation timed out after 10001 milliseconds with 0 bytes received at /var/www/wallabag.it/app2019/vendor/guzzlehttp/guzzle/src/Exception/RequestException.php:49, GuzzleHttp\Ring\Exception\ConnectException(code: 0): cURL error 28: Operation timed out after 10001 milliseconds with 0 bytes received at /var/www/wallabag.it/app2019/vendor/guzzlehttp/ringphp/src/Client/CurlFactory.php:126)","milliseconds":10004} []

janxbar commented 4 years ago

Thank you for quick response and update. I don't understand what the problem could be, the blog post can be opened in Chrome without problems. Checked with Chrome developer tools, I can see that the html returned is OK, returned within 100ms. What headers does your Curl pass with the GET request? Maybe Oracle blocks you? Could you test command line curl on the wallabag machine? Is the machine (your provider) blocked? Checked with other blog entries, fails too. Have a nice day, Jan

On Wednesday, April 22, 2020, 04:42:07 PM GMT+2, Nicolas Lœuillet <notifications@github.com> wrote:  

wallabag.it was updated.

But I have this error:

[2020-04-22 16:41:15] app.ERROR: Error: cURL error 28: Operation timed out after 10001 milliseconds with 0 bytes received when sending request: GET https://blogs.oracle.com/dave/java-contended-annotation-to-help-reduce-false-sharing 1.1 {"request":"[object] (GuzzleHttp\Psr7\Request: {})","exception":"[object] (Http\Client\Exception\NetworkException(code: 0): cURL error 28: Operation timed out after 10001 milliseconds with 0 bytes received at /var/www/wallabag.it/app2019/vendor/php-http/guzzle5-adapter/src/Client.php:116, GuzzleHttp\Exception\ConnectException(code: 0): cURL error 28: Operation timed out after 10001 milliseconds with 0 bytes received at /var/www/wallabag.it/app2019/vendor/guzzlehttp/guzzle/src/Exception/RequestException.php:49, GuzzleHttp\Ring\Exception\ConnectException(code: 0): cURL error 28: Operation timed out after 10001 milliseconds with 0 bytes received at /var/www/wallabag.it/app2019/vendor/guzzlehttp/ringphp/src/Client/CurlFactory.php:126)","milliseconds":10004} []

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

nicosomb commented 4 years ago

curl -L -I https://blogs.oracle.com/dave/java-contended-annotation-to-help-reduce-false-sharing

HTTP/2 403 ...

And the webpage content:

       <p align="justify"><font face="Arial, Helvetica, sans-serif">This site https://blogs.oracle.com/dave/java-contended-annotation-to-help-reduce-false-sharing >            >            </font><font face="Arial, Helvetica, sans-serif"> is experiencing technical difficulty. We are aware of the issue and are working as quick as possible to correct the issue. <br />
       <br />
       We apologize for any        inconvenience this may have caused. <br />
       <br />
       To speak with an Oracle sales representative: 1.800.ORACLE1.<br />
       <br />
       To contact Oracle Corporate Headquarters from anywhere in the world: 1.650.506.7000.<br />
       <br />
       To get technical support in the United States: 1.800.633.0738. </font><br />

😢

janxbar commented 4 years ago

Seems like wallabag is served from different oracle server zone than I am. I can see the blog entry without problem. I will retest later and let you know. Thank you for your time.

Jan

Sent from Nine


From: Nicolas Lœuillet notifications@github.com Sent: Wednesday, April 22, 2020 17:38 To: j0k3r/graby Cc: janxbar; Author Subject: Re: [j0k3r/graby] Graby.php fails with PREG_JIT_STACKLIMIT_ERROR (#223)

curl -L -I https://blogs.oracle.com/dave/java-contended-annotation-to-help-reduce-false-sharing

HTTP/2 403 ...

And the webpage content:

   <p align="justify"><font face="Arial, Helvetica, sans-serif">This site https://blogs.oracle.com/dave/java-contended-annotation-to-help-reduce-false-sharing >            >            </font><font face="Arial, Helvetica, sans-serif"> is experiencing technical difficulty. We are aware of the issue and are working as quick as possible to correct the issue. <br />

   <br />

   We apologize for any        inconvenience this may have caused. <br />

   <br />

   To speak with an Oracle sales representative: 1.800.ORACLE1.<br />

   <br />

   To contact Oracle Corporate Headquarters from anywhere in the world: 1.650.506.7000.<br />

   <br />

   To get technical support in the United States: 1.800.633.0738. </font><br />

😢

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

nicosomb commented 4 years ago

My web server is located in Germany.

janxbar commented 4 years ago

Just curious - from my PC, blogs.oracle.com is served by e870.dscx.akamaiedge.net [23.9.1.182]. And I am in Prague. BTW, I miss the tags autocomplete on wallabag web. It works on Android and Chrome plugin, but not on the web itself. Do you plan to implement it? I also miss full text search. I would like to help you with development, but PHP is not my piece of bread, I do Java for living and fun. Jan

On Wednesday, April 22, 2020, 06:12:52 PM GMT+2, Nicolas Lœuillet <notifications@github.com> wrote:  

My web server is located in Germany.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

janxbar commented 4 years ago

Hi, for me this works: curl 'https://blogs.oracle.com/dave/java-contended-annotation-to-help-reduce-false-sharing' \  -H 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.113 Safari/537.36' \  -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.9' \  -H 'accept-language: en-US,en;q=0.9,cs;q=0.8' \  --compressed \  -L Please note the missing -I and added --compressed. It seems the server does some headers checks. Jan On Wednesday, April 22, 2020, 05:38:33 PM GMT+2, Nicolas Lœuillet notifications@github.com wrote:

curl -L -I https://blogs.oracle.com/dave/java-contended-annotation-to-help-reduce-false-sharing

HTTP/2 403 ...

And the webpage content:

   <p align="justify"><font face="Arial, Helvetica, sans-serif">This site https://blogs.oracle.com/dave/java-contended-annotation-to-help-reduce-false-sharing >            >            </font><font face="Arial, Helvetica, sans-serif"> is experiencing technical difficulty. We are aware of the issue and are working as quick as possible to correct the issue. <br />
   <br />
   We apologize for any        inconvenience this may have caused. <br />
   <br />
   To speak with an Oracle sales representative: 1.800.ORACLE1.<br />
   <br />
   To contact Oracle Corporate Headquarters from anywhere in the world: 1.650.506.7000.<br />
   <br />
   To get technical support in the United States: 1.800.633.0738. </font><br />

😢

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.