jkphl / micrometa

A meta parser for extracting micro information out of web documents, currently supporting Microformats 1+2, HTML Microdata, RDFa Lite 1.1, JSON-LD and Link Types, written in PHP
http://micrometa.jkphl.is
MIT License
115 stars 39 forks source link

JSON-LD parser does only find the first item #16

Open jkphl opened 7 years ago

jkphl commented 7 years ago

Am 20.03.2017 um 13:59 schrieb Claas Kalwa:

Hallo Joschi,

ich habe Probleme beim Extrahieren mehrerer JSON-LD Items mit dem Micrometa V1 Parser. Er erkennt lediglich das erste Item, egal ob die Items mit @graph gruppiert sind oder seperat in eigenen script-Elementen vorkommen.

Im Anhang habe ich ein Beispiel, das eigentlich funktionieren sollte, denke ich.

Hast Du eine Idee, wo das Problem liegen könnte?

Example source:

<!DOCTYPE html>

<html>
    <head>
        <title>TODO supply a title</title>
        <meta charset="UTF-8">
        <meta name="viewport" content="width=device-width, initial-scale=1.0">

    <script type="application/ld+json">
    {
     "@context": "http://schema.org",
     "@graph": [
    {
      "name": "Google Inc.",
      "@type": "LocalBusiness",
      "address": {
        "@type": "PostalAddress",
        "addressCountry": "United States",
        "streetAddress": "1600 Amphitheatre Parkway",
        "addressLocality": "Mountain View",
        "addressRegion": "CA",
        "postOfficeBoxNumber": null,
        "postalCode": "94043",
        "telephone": "+1 650-253-0000",
        "faxNumber": "+1 650-253-0001"
      }
    },
    {
      "name": "Google Ann Arbor",
      "@type": "LocalBusiness",
      "address": {
        "@type": "PostalAddress",
        "addressCountry": "United States",
        "streetAddress": "201 S. Division St. Suite 500",
        "addressLocality": "Ann Arbor",
        "addressRegion": "MI",
        "postOfficeBoxNumber": null,
        "postalCode": "48104",
        "telephone": "+1 734-332-6500",
        "faxNumber": "+1 734-332-6501"
      }
    }
     ]
    }
    </script>

    </head>
    <body>
        <div>TODO write content</div>

    </body>
</html>
rvanlaak commented 5 years ago

The commit closing this issue does not entirely fix this issue. The JSON LD implementation still does not find multiple items in case the value of @graph has more than one root item (read: is an array).

Why? Because \Jkphl\Micrometa\Infrastructure\Parser\JsonLD::parseRootNode does only return the first found node. This probably is the specific framing implementation the class docbloc mentions (?)

Did you ever think of writing some sort of "filter" option, so users can provide the type for which building up the graph should start? That way only returning one node would still be possible.

I will try to write a test that demonstrates that only the graph of the first node gets returned.

{
  "@context": "http://schema.org",
  "@graph": [
    {
      "@type": "Article",
      "@id": "/articles/foobar",
      "comment": [
        {"@id": "/articles/foobar#comment-1"},
        {"@id": "/articles/foobar#comment-2"}
      ]
    },
    {
      "@type": "Comment",
      "@id": "/articles/foobar#comment-1"
    },
    {
      "@type": "Comment",
      "@id": "/articles/foobar#comment-2"
    }
  ]
}
jkphl commented 5 years ago

@rvanlaak Re-opening ... looking forward to any constructive suggestion! :+1:

rvanlaak commented 5 years ago

We for now added a custom JSON-LD parser that decorates the one of the library to support named graphs.

Our domain depends on filtering on @type, so that's embedded in the parser because the constructor on ParserInterface does not allow us to nicely inject it.

When $jsonLDRoot does not match specification (read: has @graph and @context), the regular JsonLD behavior gets used.

class JsonLDFilteredParser extends JsonLD
{
    public const FORMAT = 32;

    protected function parseRootNode($jsonLDRoot)
    {
        // Test Named Graphs specification
        if (!isset($jsonLDRoot->{'@graph'}, $jsonLDRoot->{'@context'})) {
            return parent::parseRootNode($jsonLDRoot);
        }

        try {
            $jsonDLDocument = JsonLDParser::getDocument($jsonLDRoot, ['documentLoader' => $this->contextLoader]);

            /** @var GraphInterface $graph */
            $graph = $jsonDLDocument->getGraph();

            // Run through all nodes to parse the first one
            foreach (FilterTypes::types as $type) {
                $nodes = $graph->getNodesByType('http://schema.org/'.$type);

                if (1 === \count($nodes)) {
                    $node = current($nodes);

                    return $this->parseNode($node);
                }
            }
        } catch (JsonLdException $exception) {
            $this->logger->error($exception->getMessage(), ['exception' => $exception]);
        }

        return null;
    }
}
Sarke commented 5 years ago

Same problem, here's an example: https://www.macobserver.com/news/apple-changes-testing-ios-14/

@rvanlaak Where is the FilterTypes class from in your example? I'm inferring that JsonLDParser is ML\JsonLD\JsonLD.

rvanlaak commented 5 years ago

FilterTypes::types is one of our local constants, it is just an array we prioritized based on which node type we want to find first.