j0k3r / php-readability

A fork of https://bitbucket.org/fivefilters/php-readability
Apache License 2.0
168 stars 36 forks source link

not working in cronjob #47

Closed khalednatour closed 5 years ago

khalednatour commented 5 years ago

do you have idea why the get content not working in cronjob?

j0k3r commented 5 years ago

Did you have an example on how you implemented it?

khalednatour commented 5 years ago

Sure.

$url = "https://www.bbc.com/news/uk-politics-47352446";
$article_content = GetArticleContent($url);

mysql_query("INSERT INTO articles 
    (url, content) VALUES 
    ('$url', '$article_content' )");

function GetArticleContent($url=null){
    require_once '../lib/readability/Readability.php';

$html = file_get_contents($url);

// Note: PHP Readability expects UTF-8 encoded content.
// If your content is not UTF-8 encoded, convert it 
// first before passing it to PHP Readability. 
// Both iconv() and mb_convert_encoding() can do this.

// If we've got Tidy, let's clean up input.
// This step is highly recommended - PHP's default HTML parser
// often does a terrible job and results in strange output.
if (function_exists('tidy_parse_string')) {
    $tidy = tidy_parse_string($html, array(), 'UTF8');
    $tidy->cleanRepair();
    $html = $tidy->value;
}

// give it to Readability
$readability = new Readability($html, $url);
// print debug output? 
// useful to compare against Arc90's original JS version - 
// simply click the bookmarklet with FireBug's console window open
$readability->debug = false;
// convert links to footnotes?
$readability->convertLinksToFootnotes = true;
// process it
$result = $readability->init();
// does it look like we found what we wanted?
if ($result) {
    $content = $readability->getContent()->innerHTML;
    // if we've got Tidy, let's clean it up for output
    if (function_exists('tidy_parse_string')) {
        $tidy = tidy_parse_string($content, array('indent'=>true, 'show-body-only' => true), 'UTF8');
        $tidy->cleanRepair();
        $content = $tidy->value;
    }
    return trim(strip_tags($content));
} else {
    return "";
}

} 
j0k3r commented 5 years ago

Instead of $content = $readability->getContent()->innerHTML; try:

$content = $readability->getContent()->ownerDocument->saveXML($readability->getContent());
khalednatour commented 5 years ago

Thank you! both are working fine when requesting it through webpage normal, but my issue happened when running it through a cronjob .

j0k3r commented 5 years ago

I've no idea then. What's your error? How do you define your cronjob?

khalednatour commented 5 years ago

I didn't find any error :D

Question, Does your code depend on something outside the local server?

j0k3r commented 5 years ago

Nope.

j0k3r commented 5 years ago

This issue is fairly old and there hasn't been much activity on it. Closing, but please re-open if it still occurs.