s9e / TextFormatter

Text formatting library that supports BBCode, HTML and other markup via plugins. Handles emoticons, censors words, automatically embeds media and more.
MIT License
230 stars 35 forks source link

phpbb fb.watch #165

Closed elchio closed 3 years ago

elchio commented 3 years ago

Hi. I am using latest files from this repo and cant figure out how can I make fb.watch url to parse. Urls like facebook.com... are parsed ok.

All cache is cleared. phpbb is 3.3.2 and has latest media embed plugin intalled.

JoshyPHP commented 3 years ago

What URL did you try?

elchio commented 3 years ago

None of this kind works: https://fb.watch/4vjVYg3uQX/

JoshyPHP commented 3 years ago

I can confirm that it works fine at the library level.

$configurator = new s9e\TextFormatter\Configurator;
$configurator->MediaEmbed->add('facebook');

extract($configurator->finalize());

$text = 'https://fb.watch/4vjVYg3uQX/';
$xml  = $parser->parse($text);

die("$xml\n");
<r><FACEBOOK id="471206254237838" type="v" user="razproza">https://fb.watch/4vjVYg3uQX/</FACEBOOK></r>

If it doesn't work for you inside of phpBB, it can be because of any other reason. Maybe your server isn't configured to allow scraping, maybe something inside of phpBB or your extension prevents it from working, maybe your installation is using outdated files. Make sure that your web host has enabled cURL in PHP or ask them about it. If nothing works, ask in the phpBB extension's forums.

If you use the original Facebook URL instead of the fb.watch short link, it may work better for you too.

elchio commented 3 years ago

Thanks - I will check it further than.

EDIT: Any other url are parsing fine (youtube, facebook.com, all streaming etc).

elchio commented 3 years ago

You were right - the problem is definitely only with content that need to be scraped - other url with scraping doesn't work either. Curl is enabled - the cache directory is writeable and there is no errors in apache/php logs. I have made new phpbb clean installation and still no luck - same error.

I cheked that url is in fact properly detected by parser but there is no data from external source.

I know that this is server issue but if you have any ideas, what could i check i woulld be really greatful.

If not still thanks for you help.

JoshyPHP commented 3 years ago

You can try running the following script on your server, and if you post the output I'll look at it for clues.

$handle = curl_init();
curl_setopt($handle, CURLOPT_ENCODING,       '');
curl_setopt($handle, CURLOPT_FAILONERROR,    true);
curl_setopt($handle, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
curl_setopt($handle, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($handle, CURLOPT_TIMEOUT,        10);
curl_setopt($handle, CURLOPT_HTTPGET,        true);
curl_setopt($handle, CURLOPT_HTTPHEADER,     ['User-agent: PHP (not Mozilla))']);
curl_setopt($handle, CURLOPT_URL,            'https://fb.watch/4vjVYg3uQX/');

foreach ([false, true] as $v)
{
    var_dump(microtime(true));
    curl_setopt($handle, CURLOPT_HEADER, $v);
    var_dump(curl_exec($handle));
}
var_dump(microtime(true));
elchio commented 3 years ago
float(1625831396.8337)
string(253637) "

and then:

float(1625831131.6066)
string(255354) "HTTP/1.1 302 Found
Location: https://www.facebook.com/watch/?v=471206254237838
x-fb-rlafr: 0
Pragma: no-cache
Cache-Control: private, no-cache, no-store, must-revalidate
Expires: Sat, 01 Jan 2000 00:00:00 GMT
X-Content-Type-Options: nosniff
X-XSS-Protection: 0
X-Frame-Options: DENY
Strict-Transport-Security: max-age=31536000; preload; includeSubDomains
Content-Type: text/html; charset="utf-8"
X-FB-Debug: RxHHRJ6pKLeSckiXLhfwham7NBjHHZ/CVZDy3aZl5xc00gK7MgKjV8Z6kwIQkw6EGJ1PlH+Gaxu0CUBp7veuzA==
Date: Fri, 09 Jul 2021 11:45:31 GMT
Alt-Svc: h3-29=":443"; ma=3600,h3-27=":443"; ma=3600
Connection: keep-alive
Content-Length: 0

HTTP/1.1 302 Found
Location: https://www.facebook.com/login/?next=https%3A%2F%2Fwww.facebook.com%2Fwatch%2F%3Fv%3D471206254237838
Strict-Transport-Security: max-age=15552000; preload
Content-Type: text/html; charset="utf-8"
X-FB-Debug: CvWQUD0lRrJAxAvc1qgHdT8qPuj+jwU6ePJ7iFkA7m4NmCrbX13WP7mOg3VKfrZfPz5GXHFqBwyhPIOMiD3tsQ==
Date: Fri, 09 Jul 2021 11:45:31 GMT
Alt-Svc: h3-29=":443"; ma=3600,h3-27=":443"; ma=3600
Connection: keep-alive
Content-Length: 0

HTTP/1.1 200 OK
Vary: Accept-Encoding
Content-Encoding: gzip
x-fb-rlafr: 0
Pragma: no-cache
Cache-Control: private, no-cache, no-store, must-revalidate
Expires: Sat, 01 Jan 2000 00:00:00 GMT
X-Content-Type-Options: nosniff
X-XSS-Protection: 0
X-Frame-Options: DENY
Strict-Transport-Security: max-age=15552000; preload
Content-Type: text/html; charset="utf-8"
X-FB-Debug: IxBcFTZKXQMv2LX/ogZwEhv/dm3qavyDJTR0FjdCsTcnOmu5VP0Xm2jNVsLBZ/qPg+1cP33bkXzQuss5YjcjaA==
Date: Fri, 09 Jul 2021 11:45:32 GMT
Priority: u=3,i
Alt-Svc: h3-29=":443"; ma=3600,h3-27=":443"; ma=3600
Transfer-Encoding: chunked
Connection: keep-alive

and last:

float(1625831399.1287)

JoshyPHP commented 3 years ago

I assume you skipped the HTML parts. It looks like Facebook is sending you a different path from mine. Yours goes to a login page while mine doesn't, and that's probably why yours can't resolve the short link to a canonical one.

You can try applying the following modification to the file in vendor/s9e/text-formatter, it will probably fix it in this specific case. It has some potential side effects, so I'm still evaluating whether to apply it to the library.

diff --git a/src/Plugins/MediaEmbed/Parser.php b/src/Plugins/MediaEmbed/Parser.php
index 0c1ee9714..d99046be9 100644
--- a/src/Plugins/MediaEmbed/Parser.php
+++ b/src/Plugins/MediaEmbed/Parser.php
@@ -243,7 +243,8 @@ protected static function scrape(array &$attributes, $url, array $config, $cache
        protected static function wget($url, $cacheDir, $config)
        {
                $options = [
-                       'headers' => (isset($config['header'])) ? (array) $config['header'] : []
+                       'headers'       => (isset($config['header'])) ? (array) $config['header'] : [],
+                       'returnHeaders' => true
                ];

                return @self::getHttpClient($cacheDir)->get($url, $options);
elchio commented 3 years ago

Yeah! That solved my problem for now :) Could you write more about potential side effects?

And one question more - could you show me your output?

JoshyPHP commented 3 years ago

I'm not aware of any specific side effects, but since it alters the data used when scraping there's a chance it could change the behaviour.

The difference between your output and mine is in the "Location" redirects. I'll post the relevant part below.

HTTP/2 302 
location: https://www.facebook.com/watch/?v=471206254237838
...

HTTP/2 302 
location: https://www.facebook.com/razproza/videos/471206254237838/
...

HTTP/2 200
elchio commented 3 years ago

Ok, many thanks for your help.