oscarotero / Embed

Get info from any web service or page
MIT License
2.08k stars 312 forks source link

Doesn't extract PublishedTime from datePublished field #385

Closed paulhennell closed 3 years ago

paulhennell commented 3 years ago

I've taken a quick run at this as it seems pretty fixable, but can't work out where it's going wrong. You can see an example on the online demo here: https://oscarotero.com/embed/demo/index.php?url=https%3A%2F%2Fwww.bbc.com%2Fnews%2Fuk-england-london-54204929

The BBC article has a 'datePublished' & 'dateModifed' set in the "All data collected" array, but nothing is extracted to the 'publishedTime' field.

Looking at the PublishedTime detector it is looking for a 'datepublished' option, but I haven't followed the logic to work out why it's failing.

oscarotero commented 3 years ago

Mmm, seems like the detector is only looking for pagePublished (not datePublished or dataModified) in the linkedData store: https://github.com/oscarotero/Embed/blob/master/src/Detectors/PublishedTime.php#L26

Could you test if including the other two keys fixes the problem? And create a pull request? Many thanks!

paulhennell commented 3 years ago

I've tried changing the code in Published time but getting some varied results.

Running with test in PR: #386

//Returns date as expected & passes test
?: $ld->time('datePublished')

//Seems like it should work based on exisiting code, but always returns null
?: $ld->time('pagePublished', 'datePublished')

//Throws and error!
//Error: Call to undefined method ML\JsonLD\TypedValue::getProperty()
// From LinkedData.php L71
?: $ld->time('datePublished', 'pagePublished')

I attempted to delve a bit deeper, but got a bit lost in what it's trying to do within linkeddata get.

oscarotero commented 3 years ago

Ok, I see. The problem is that linkedData does not work exactly like other containers like meta. In meta, you can pass multiple arguments and it returns the first value found. In linkedData, the arguments are to retrieve properties inside other properties. For example:

$ld->time('pagePublished', 'datePublished'); //Returns the value pagePublished->datePublished

In this case, you should to this:

$ld->time('pagePublished') ?: $ld->time('datePublished');

Maybe the ld container should be changed at some point for a similar behavior as others. For example, using a dot notation if we want to get inner values $ld->time('article.datePublished');

paulhennell commented 3 years ago

Dot notation would make more sense IMO, having methods with the same names and arguments working in such different ways isn't very clear.

I have now added the 'double call' solution to pr #386 - which fixes the test included.

Thanks for your guidance on this - working on a project where I'm using this to get info on various submited links, so hope to be able to add further PRs should I find other sites where the current extractors miss something!

oscarotero commented 3 years ago

Great. Thanks for your contribution!