j0k3r / graby

Graby helps you extract article content from web pages
MIT License
362 stars 73 forks source link

site_config's author definition is ignored if present in json #306

Closed mutschler closed 1 year ago

mutschler commented 1 year ago

is there any reason for this? all other fields could be overwritten by the site_config, but authors not: https://github.com/j0k3r/graby/blob/master/src/Extractor/ContentExtractor.php#L213

IMHO there shouldn't be a difference between what can be overwritten and what not. so just getting rid of the if (empty($this->authors)) part should make it behave like all other tags

j0k3r commented 1 year ago

Because authors can be extracted from JSON-LD data few lines before in extractDefinedInformation.

I assume information from JSON-LD (if properly extracted) might be more accurate than those provided by the authors rule in a siteconfig.

As we can't determine if the information found in the JSON-LD can be more accurate than the one from the authors rule a decision has been made to trust JSON-LD.

mutschler commented 1 year ago

ah, makes sense... setting skip_json_ld=true helped on that. Thanks!