spekulatius / PHPScraper

A universal web-util for PHP.
https://phpscraper.de
GNU General Public License v3.0
520 stars 74 forks source link

Parsing structured data (ld+json) #16

Open spekulatius opened 4 years ago

spekulatius commented 4 years ago

It would make sense to parse the structured data JSON provided by some sites within the head-tag. This way the already accessed information from the meta tags could be made more robust and possibility extended later on.

Ref: https://developers.google.com/search/docs/data-types/article

spekulatius commented 4 years ago

Context: https://json-ld.org/

eposjk commented 1 year ago

Some thoughts:

A website can contain multiple JSONLD blocks. It seems possible to combine them ( https://stackoverflow.com/a/48295719 ) - probably, we should use the Array notation:

[
  {
     "@context": "http://schema.org",
     "@type": "Organization"
  },
  {
     "@context": "http://schema.org",
     "@type": "BreadcrumbList"
  }
]

Would it make sense to always return an array - even if the page contains only one JSONLD block? (probably yes)

spekulatius commented 1 year ago

Hey @eposjk,

good point on the multiple ld+json blocks.

Yeah, if data exists in multiple positions we should go for an array. It might be only one element, but at least it's future proof. Merging blocks into one might be an option too.`

Cheers, Peter

joshua-bn commented 1 year ago

This is what I'm using:

        $jsonLd = [];
        foreach ($dom->getElementsByTagName('script') as $script) {
            if ($script->getAttribute('type') === 'application/ld+json') {
                $json_txt = preg_replace('@/\*.*?\*/@', '', $script->textContent);
                $json_txt = preg_replace("/\r|\n/", ' ', trim($json_txt));
                $schema = json_decode($json_txt, true);
                if (isset($schema['@graph'])) {
                    $jsonLd += $schema['@graph'];
                } else {
                    $jsonLd[] = $schema;
                }
            }
        }