lanthaler / JsonLD

JSON-LD processor for PHP
MIT License
335 stars 62 forks source link

Parsing for some sites is broken, maybe a schema.org change: Loading http://schema.org failed #96

Closed greggh closed 5 years ago

greggh commented 5 years ago

Code that has been working fine for a few months started breaking today. Any time I feed in a site with JsonLD to get checked for structured data. I get:

Message: Loading http://schema.org failed Code: loading remote context failed

I can visit schema.org just fine with a web browser, so it's not some sort of block in place.

It is also happening to other people using another library that relies on this one: https://github.com/jkphl/micrometa/issues/34

It's this section in Processor.php (vendor/ml/json-ld/Processor.php) that's throwing the error. In the processContext function.

                try {
                    $remoteContext = $this->loadDocument($remoteContext);
                } catch (JsonLdException $e) {
                    throw new JsonLdException(
                        JsonLdException::LOADING_REMOTE_CONTEXT_FAILED,
                        "Loading $remoteContext failed",
                        null,
                        null,
                        $e
                    );
                }
gabriele-carbonai commented 5 years ago

did you find a solution?

RichardWallis commented 5 years ago

Hi lanthaler/JsonLD folks.

What I think you are/were seeing here is a defect that appeared when Schema.org v3.5 was released. Under certain circumstances the JSONLD context document did not include CORS header (See public-json-ld-wg thread for details) in its response.

This problem was addressed in a fix I applied to the site April 5th.

If this still is a problem, can you raise an issue in the schemaorg/schemaorg repo.

gabriele-carbonai commented 5 years ago

I have the same extension and same code in two different website, one is work the other one is not work.

RichardWallis commented 5 years ago

Monitoring the server logs for the current version I can see many calls from "lanthaler JsonLD" clients successfully requesting the context file without errors.

@gomonkey Do you have debug traces of the failing calls (preferably with http request/response values) that can help identify what might be your particular issue.

As a matter of interest, from the logs I can see client requests, to schema.org, from a single ip address requesting the context file again and again (often within milliseconds of its last request - one 5 second snapshot revealed 72 requests from a single AWS hosted ip). Has any thought been put towards the caching of such requests, potentially using http response headers such as Last-Modified: & Cache-Control: max-age.

This I believe would be good for the performance of clients using this processor and general traffic reduction (especially for cloud based systems paying for bytes transferred).

gabriele-carbonai commented 5 years ago

the debug traces is the same like @greggh

private function loadDocument($input)
{
   if (false === is_string($input)) {
      // Return as is - it has already been parsed
      return $input;
   }
   $document = $this->documentLoader->loadDocument($input);
   return $document->document;
} 

$input is not false $document maybe is empty and is not return $document->document;

If I put die() before return, it is still printing error.

Something is happen here:

$document = $this->documentLoader->loadDocument($input);
RichardWallis commented 5 years ago

@gomonkey I concur with your conclusion that the problem is somewhere in the code that results from $this->documentLoader->loadDocument($input);.

I have no experience with the code in the JsonLD Processor, and my php is very rusty.

Unfortunately without details of the http request, including headers, and the http status & response returned, that results from that call, it will be exceedingly difficult to identify the cause.

I am interested that you say you have one instance operating correctly, and one that fails. What is the difference between them - network, hosting, caching architecture, firewalls, etc.

gabriele-carbonai commented 5 years ago

Everything is the same (code, server,etc.), just change customer and products. The code where is work is just pasted from where today is not work. But I am going to check, can be some empty variable?

RichardWallis commented 5 years ago

Without an understanding of the code I am not able to predict.

All we should get at schema.org is a http request which contains this header: Accept: application/ld+json

Perhaps someone with an understanding of the low-level php code could help with your diagnosis.

greggh commented 5 years ago

Hey @RichardWallis, thanks for showing up here! I just retried the code, and it's still doing the same thing. I admit it is good timing with the change/fix on the 5th, but it looks like that didn't solve it.

Did you recently change the http / https functionality? http://schema.org now is a hard 301. Was it always? That seems to be where the code is dying, right after that 301.

greggh commented 5 years ago

@RichardWallis it looks like I am getting 2 response's from the server, and both are text/html, including the one that should be json. At the very least that second one for the .json file should be one of: application/ld+json, application/json. That is where the code is dying.

‌‌$http_response_header;
‌array (
  0 => 'HTTP/1.0 301 Moved Permanently',
  1 => 'Location: https://schema.org/',
  2 => 'X-Cloud-Trace-Context: b35637c48325fd44f8a7e63de27a549b',
  3 => 'Date: Mon, 08 Apr 2019 15:05:44 GMT',
  4 => 'Content-Type: text/html',
  5 => 'Server: Google Frontend',
  6 => 'Content-Length: 0',
  7 => 'HTTP/1.0 302 Found',
  8 => 'Content-Type: text/html; charset=utf-8',
  9 => 'Access-Control-Allow-Origin: *',
  10 => 'Location: https://schema.org/docs/jsonldcontext.json',
  11 => 'Vary: Accept, Accept-Encoding',
  12 => 'X-Cloud-Trace-Context: 543787090044a453c6bd1eaafdb08e8a',
  13 => 'Date: Mon, 08 Apr 2019 14:57:23 GMT',
  14 => 'Server: Google Frontend',
  15 => 'Content-Length: 0',
  16 => 'Cache-Control: public, max-age=600',
  17 => 'Age: 501',
  18 => 'Alt-Svc: quic=":443"; ma=2592000; v="46,44,43,39"',
  19 => 'HTTP/1.0 200 OK',
  20 => 'Access-Control-Allow-Origin: *',
  21 => 'Vary: Accept, Accept-Encoding',
  22 => 'ETag: 6c732607a47aae095f1e5d2dcfd39846',
  23 => 'Last-Modified: Mon, 08 Apr 2019 09:09:19 GMT',
  24 => 'Content-Type: text/html; charset=utf-8',
  25 => 'X-Cloud-Trace-Context: 614fc00a3b7b58f328a983ef7f384777',
  26 => 'Date: Mon, 08 Apr 2019 14:57:23 GMT',
  27 => 'Server: Google Frontend',
  28 => 'Content-Length: 139274',
  29 => 'Age: 502',
  30 => 'Cache-Control: public, max-age=600',
  31 => 'Alt-Svc: quic=":443"; ma=2592000; v="46,44,43,39"',

Because it's coming back as text/html, this is failing at line 104 in FileGetContentsLoader.

                if ('application/ld+json' === $remoteDocument->mediaType) {
                    $remoteDocument->contextUrl = null;
                } elseif (('application/json' !== $remoteDocument->mediaType) &&
                    (0 !== substr_compare($remoteDocument->mediaType, '+json', -5))) {
                    throw new JsonLdException(
                        JsonLdException::LOADING_DOCUMENT_FAILED,
                        'Invalid media type',
                        $remoteDocument->mediaType
                    );
                }
RichardWallis commented 5 years ago

@greggh The hard 301 redirect from http://schema.org to https://schema.org has been in the code since the last version (3.4 - released June 2018)

... your trace has just come in I'll check it out...

gabriele-carbonai commented 5 years ago

I don't have anymore problem and I didn't change code. What happened?

RichardWallis commented 5 years ago

@greggh Thanks for the input it helped me track down an obscure issue - now fixed:

wallisr$ curl -v -s -L --header "Accept: application/ld+json" http://schema.org  1> /dev/null
...
...
< HTTP/1.1 301 Moved Permanently
< Location: https://schema.org/
...
...
* Issue another request to this URL: 'https://schema.org/'
...
...
< HTTP/2 302 
...
< location: https://schema.org/docs/jsonldcontext.jsonld
...
...
* Issue another request to this URL: 'https://schema.org/docs/jsonldcontext.jsonld'
...
> GET /docs/jsonldcontext.jsonld HTTP/2
...
< content-type: application/ld+json; charset=utf-8
...
< content-length: 139274

Let me know if it is now OK for you.

~Richard

RichardWallis commented 5 years ago

@gomonkey Looks like my fix worked

greggh commented 5 years ago

@RichardWallis so glad I could help, and immensely happy you showed up in the thread. Thanks for all the help!

lanthaler commented 5 years ago

Thanks @RichardWallis. Closing this issue now. Please re-open if there are still issues.