Closed etiess closed 5 years ago
The whole authors are coming from the jsonld script:
{
"@context": "http:\/\/schema.org",
"@type": "NewsArticle",
"mainEntityOfPage": {
"@type": "WebPage",
"@id": "http:\/\/www.lemonde.fr\/economie\/article\/2017\/11\/24\/la-zone-euro-renoue-avec-une-croissance-solide_5219697_3234.html"
},
"headline": "L\u2019embellie \u00e9conomique de la zone euro, en cinq questions",
"dateCreated": "2017-11-24T10:51:57+0100",
"datePublished": "2017-11-24T10:52:04+0100",
"publisher": {
"@type": "Organization",
"name": "Le Monde",
"logo": {
"@type": "ImageObject",
"url": "http:\/\/s1.lemde.fr\/medias\/web\/1.2.705\/img\/elements_lm\/logo_lm_print.png",
"width": "240",
"height": "42"
}
},
"dateModified": "2017-11-25T06:36:42+0100",
"description": "Apr\u00e8s des ann\u00e9es de vaches maigres, les bons indicateurs se multiplient, la confiance est au plus haut en Europe. Mais \u00e0 long terme, les d\u00e9fis structurels demeurent",
"author": {
"@type": "Person",
"name": "Marie Charrel, Marie de Verg\u00e8s et Elise Barthet"
},
"image": {
"@type": "ImageObject",
"url": "http:\/\/img.lemde.fr\/2017\/11\/24\/64\/0\/2681\/1340\/696\/348\/60\/0\/f1c9953_28568-5tbas4.2x482.jpg",
"width": "696",
"height": "348"
},
"isAccessibleForFree": "false",
"hasPart": {
"@type": "WebPageElement",
"isAccessibleForFree": "false",
"cssSelector": ".teaser_article .js_teaser_article"
}
}
By default, graby adds information to what it already found. Try to remove everything related to author in the siteconfig and try again
About the date (which is also retrieved from the jsonld) graby use the published date over the modified date.
Removing everything related to author in the siteconfig gives only "Marie Charrel, Marie de Vergès et Elise Barthet". It works but that's less clean than the result of author: //a[@class='auteur']
that gives an array with the 3 authors.
If the siteconfig does find something for authors
, why does graby add information? Wouldn't it be possible to add information ONLY if the siteconfig doesn't find anything?
About the date, I'm not sure to understand what you're saying. We have:
Data fetched: [array]
:
"last-modified": "Tue, 05 Dec 2017 03:40:22 GMT"
The last-modified
from the data come from the HTTP request it has nothing to do with the content :slightly_smiling_face:
I don't know about overriding information maybe you're right. What do you think @nicosomb @tcitworld ? :arrow_right: Should we override values when we found a jsonld script (see my previous comment above) or not? For example: we found a date in the content (using siteconfig or not) and a jsonld also define a date (which might be different). Which one should we choose?
The
last-modified
from the data come from the HTTP request it has nothing to do with the content 🙂
OK ;-) But then, why doesn't wallabag show the last modification date? (2017-11-25)
And I'm not sure to understand all the challenges there, but my idea of overriding was more about "specific siteconfig file" versus "standard grabby"
Because graby override it: https://github.com/j0k3r/graby/blob/master/src/Extractor/ContentExtractor.php#L1074-L1081
OK, I understand now. I think it's the right thing to do.
And there's no possibility to display both? Something that could be great would be to display both when the mouse passes on it (like when we pass over the original link, we see the original title).
What do we do about authors?
Following the revamp of Le Monde's website, the URLs in abonnes.lemonde.fr
are no longer used. I am currently working on a new lemonde.fr.txt
that I will PR on fivefilters repository.
Several authors: there are duplicates:
author: //span[@id='publisher']
was not necessary (and it's not present inlemonde.fr.txt
)author: //a[@class='auteur']
) ?Other remaining issue: last modified date is not the right one: it grabs
05 Dec 2017
instead of25 Nov 2017
Same as mediapart: as there's a paywall, I don't know where to push the modification.