j0k3r / graby-site-config

Graby site config files
Other
18 stars 29 forks source link

Update abonnes.lemonde.fr.txt #28

Closed etiess closed 5 years ago

etiess commented 6 years ago

Several authors: there are duplicates:

Other remaining issue: last modified date is not the right one: it grabs 05 Dec 2017 instead of 25 Nov 2017

Same as mediapart: as there's a paywall, I don't know where to push the modification.

j0k3r commented 6 years ago

The whole authors are coming from the jsonld script:

{
    "@context": "http:\/\/schema.org",
    "@type": "NewsArticle",
    "mainEntityOfPage": {
        "@type": "WebPage",
        "@id": "http:\/\/www.lemonde.fr\/economie\/article\/2017\/11\/24\/la-zone-euro-renoue-avec-une-croissance-solide_5219697_3234.html"
    },
    "headline": "L\u2019embellie \u00e9conomique de la zone euro, en cinq questions",
    "dateCreated": "2017-11-24T10:51:57+0100",
    "datePublished": "2017-11-24T10:52:04+0100",
    "publisher": {
        "@type": "Organization",
        "name": "Le Monde",
        "logo": {
            "@type": "ImageObject",
            "url": "http:\/\/s1.lemde.fr\/medias\/web\/1.2.705\/img\/elements_lm\/logo_lm_print.png",
            "width": "240",
            "height": "42"
        }
    },
    "dateModified": "2017-11-25T06:36:42+0100",
    "description": "Apr\u00e8s des ann\u00e9es de vaches maigres, les bons indicateurs se multiplient, la confiance est au plus haut en Europe. Mais \u00e0 long terme, les d\u00e9fis structurels demeurent",
    "author": {
        "@type": "Person",
        "name": "Marie Charrel, Marie de Verg\u00e8s et Elise Barthet"
    },
    "image": {
        "@type": "ImageObject",
        "url": "http:\/\/img.lemde.fr\/2017\/11\/24\/64\/0\/2681\/1340\/696\/348\/60\/0\/f1c9953_28568-5tbas4.2x482.jpg",
        "width": "696",
        "height": "348"
    },
    "isAccessibleForFree": "false",
    "hasPart": {
        "@type": "WebPageElement",
        "isAccessibleForFree": "false",
        "cssSelector": ".teaser_article .js_teaser_article"
    }
}

By default, graby adds information to what it already found. Try to remove everything related to author in the siteconfig and try again

About the date (which is also retrieved from the jsonld) graby use the published date over the modified date.

etiess commented 6 years ago

Removing everything related to author in the siteconfig gives only "Marie Charrel, Marie de Vergès et Elise Barthet". It works but that's less clean than the result of author: //a[@class='auteur'] that gives an array with the 3 authors.

If the siteconfig does find something for authors, why does graby add information? Wouldn't it be possible to add information ONLY if the siteconfig doesn't find anything?

About the date, I'm not sure to understand what you're saying. We have:

j0k3r commented 6 years ago

The last-modified from the data come from the HTTP request it has nothing to do with the content :slightly_smiling_face:

I don't know about overriding information maybe you're right. What do you think @nicosomb @tcitworld ? :arrow_right: Should we override values when we found a jsonld script (see my previous comment above) or not? For example: we found a date in the content (using siteconfig or not) and a jsonld also define a date (which might be different). Which one should we choose?

etiess commented 6 years ago

The last-modified from the data come from the HTTP request it has nothing to do with the content 🙂

OK ;-) But then, why doesn't wallabag show the last modification date? (2017-11-25)

And I'm not sure to understand all the challenges there, but my idea of overriding was more about "specific siteconfig file" versus "standard grabby"

j0k3r commented 6 years ago

Because graby override it: https://github.com/j0k3r/graby/blob/master/src/Extractor/ContentExtractor.php#L1074-L1081

etiess commented 6 years ago

OK, I understand now. I think it's the right thing to do.

And there's no possibility to display both? Something that could be great would be to display both when the mouse passes on it (like when we pass over the original link, we see the original title).

What do we do about authors?

techexo commented 5 years ago

Following the revamp of Le Monde's website, the URLs in abonnes.lemonde.fr are no longer used. I am currently working on a new lemonde.fr.txt that I will PR on fivefilters repository.

j0k3r commented 5 years ago

Closing