j0k3r / graby

Graby helps you extract article content from web pages
MIT License
363 stars 73 forks source link

title detection not working on manager-magazin.de #224

Closed Strubbl closed 4 years ago

Strubbl commented 4 years ago

Hi j0k3r, in wallabag all articles from manager-magazin.de do not get the title extracted correctly.

I get the title: a-1305306

But i expect to get the following title: Sperrzone im Norden wegen Coronavirus: Das müssen Italien-Reisende jetzt wissen

a concrete example, which i tested with f43.me: https://www.manager-magazin.de/lifestyle/reise/coronavirus-was-italien-reisende-jetzt-wissen-muessen-a-1305306.html

The only hint for detecting that wrong title was found in the f43 debug log. there are two lines:

title matched from JsonLd: {Sperrzone im Norden wegen Coronavirus: Das müssen Italien-Reisende jetzt wissen}
title matched from JsonLd: {a-1305306}

The latter one is exactly the title.

Do you have any idea why it fails in wallabag?

Kdecherf commented 4 years ago

Hello @Strubbl,

Here is the json+ld data for your link:

        "name":             "a-1305306",
        "headline":         "Sperrzone im Norden wegen Coronavirus: Das müssen Italien-Reisende jetzt wissen",

Graby considers name as the most important value to keep, see the order here: https://github.com/j0k3r/graby/blob/c27bcc8ab462a9c1a6dd07a35d5de279f7799ef2/src/Extractor/ContentExtractor.php#L1316-L1322

and here: https://github.com/j0k3r/graby/blob/c27bcc8ab462a9c1a6dd07a35d5de279f7799ef2/src/Extractor/ContentExtractor.php#L1347-L1354

A quick workaround for this website is to ignore json+ld data.

Strubbl commented 4 years ago

Nice to know. Thank you.

Is it possible to ignore json+ld data with a site-config file? Or how do i do a workaround?

Kdecherf commented 4 years ago

Is it possible to ignore json+ld data with a site-config file?

Yeah, you can add skip_json_ld: true to the relevant site-config file to do that.