GraphQL returns an error, causing elemental to fail to initialise in live mode (sometimes)

edwilde commented 1 year ago

When the user opens a page within the CMS the elemental area gets stuck indefinitely on the loading screen, causing the spinning Silverstripe logo of death.

Screen Shot 2022-08-18 at 1 48 50 PM

Response when requesting /admin/graphql/ is Unknown graphql type: Query instead of the expected data.

I have seen this in a few different environments when Silverstripe 4.11.2 is deployed with Silverstripe Cloud and php 8.0.

I have not seen it locally and I do not know if it can be reproduced with a defined set of steps. It is perhaps linked to many changes being made to page content, as it has typically manifested just before site launches. But that is just a hunch.

This can be resolved by running /dev/graphql/build to rebuild the schema, however this will not fix it permanently only for an indeterminate amount of time; I think the minimum time was about 1.5 days.

Apparently this can also be resolved by generating and committing the graphql cache files to the repo and deploying those, however I have not confirmed that.

Acceptance criteria

[ ] We have identified what triggers this behaviour
[x] This behaviour never occurs.

PRs

michalkleiner commented 1 year ago

Might be the difference between dev-built (dev mode) GraphQL schema and non-dev-built (test/live mode) schema where each uses different obfuscator when generating the classes. Once your schema is built in dev mode it doesn't work in test/live mode and the other way round. Also when deployed if the webserver/PHP doesn't have write access to the .graphql-generated folder to be created next to the code it would fail.

Best option is to generate the graphql schema files using the vendor/bin/sake /dev/graphql/build clear=1 flush=1 schema=default and vendor/bin/sake /dev/graphql/build clear=1 flush=1 schema=admin commands and commit the product of that to the repo for a release. That works for us every time. If you delete the .graphql and .graphql-generated folders and run vendor/bin/sake dev/build flush=1 command that should also regenerate the schema files.

The issue here that I think would be worth checking is that it seems the system doesn't use existing dev-mode generated schema when in non-dev mode and the other way round. Same issue applies when trying to visit the assets admin that also relies on graphql queries.

edwilde commented 1 year ago

@GuySartorelli this has been observed in a single server environment too (1/1/1).

Thanks @michalkleiner I guess I am hoping for a solution which doesn't require the schema files to be committed to the repo. To do this reliably when many devs are working on a project feels like it would require another automated process to be run before deployment.

michalkleiner commented 1 year ago

@edwilde can you confirm that the environment type and the generated schema class name obfuscation is the problem here?

If it is, that's one thing, another thing is to have your servers/hosting correctly configured so that the folders and generated schema files can be written during dev/build on the server (docs), and that, of course, you don't have the automatic dev/build schema generation turned off.

There's also the multi-server env consideration mentioned in the docs.

edwilde commented 1 year ago

@michalkleiner it is Silverstripe Cloud, so I would hope it is properly configured. I don't really know how to confirm whether class name obfuscation is the problem, I also currently don't have any stacks exhibiting the issue.

michalkleiner commented 1 year ago

@edwilde You'd get the same issue locally just by changing the env mode from dev to test or live.

I'm seeing the same issue in Silverstripe Cloud relating to having the schema built locally in dev mode, committed and deployed, and then not working in UAT, whereas if I build it in test mode and commit and deploy, it works in UAT.

The permissions and that the build only builds it on a single instance in a multi-server environment is a known problem in SC with multiple servers and the recommendation to us was to commit the files and deploy them rather than having them auto-build on deployment.

UndefinedOffset commented 1 year ago

This is not just related to having multiple servers or Silverstripe cloud. I've been seeing this on sites in live mode (not switching between environment types) even after just running dev/build.

emteknetnz commented 1 year ago

Just a note that may be relevant during my investigations on this - .graphql-generated/__graphql-schema.php file needs to be present for the "Unknown graphql type: Query" to show. It's likely that the .graphql-generated/Q66c1b4c7f3dc385b68a9fa903ccd016d.php (File for the Query class obfuscated with HashNameObfuscator) file is missing for whatever reason.

however this will not fix it permanently only for an indeterminate amount of time; I think the minimum time was about 1.5 days.

My theory is webserver is randomly deleting Q66c1b4c7f3dc385b68a9fa903ccd016d.php file after a period of time, but not the __graphql-schema.php file, nor the .graphql-generated dir itself. Without any access to the webserver myself though I have no way of confirming if this is actually what's happening.

Either that, or the entire .graphql-generated dir gets deleted for whatever reason (this does seems more likely than individual files), and for whatever reason, the dev only HybridObfuscator was used to re-generate the schema and filenames such as Query_66c1b4c7f3dc385b68a9fa903ccd016d are used - but then when HTTP requests HashNameObfuscator is used again, hence the files are not found. However this seems highly unlikely.

michalkleiner commented 1 year ago

I don't see why a webserver would delete any files at all.. a missing file plays more towards the multi-server situation where files generated on one machine are not available on the other ones unless committed to the repo. We probably need to distinguish between the first request schema auto-generation and the schema being committed in the codebase with the same or different obfuscator. From that perspective, I think using the HashNameObfuscator as the default everywhere unless explicitly changed e.g. for debugging is a sensible move.

GuySartorelli commented 1 year ago

The documentation both for CMS and for Silverstripe Cloud recommend committing only for special circumstances - using HashNameObfuscator as the default everywhere is a good step but it doesn't resolve the problem for people following the documented advice to just let it build on demand in most cases.

emteknetnz commented 1 year ago

I don't see why a webserver would delete any files at all.. a missing file plays more towards the multi-server situation where files generated on one machine are not available on the other ones

Agree, the "files being randomly deleted" theory is honestly a pretty poor theory.

The multi-server situation where the .graphql-generated dir is not replicating, I would expect to see a different symptom. There I would expect to see elemental work in some requests, and not in others. I would not expect it to not work at all which I believe what was observed.

Also, the .graphql-generated/__graphql-schema.php file needs to be present to get the particular "Unknown graphql type: Query" error reported. __graphql-schema.php conains a class class Types extends AbstractTypeRegistry. This doesn't align with ".graphql-generated dir not available at all"

__graphql-schema.php

class Types extends AbstractTypeRegistry
{
    protected static function getSourceDirectory(): string
    {
        return __DIR__;
    }
    // ...
    public static function Query() { return static::get('Query'); }

AbstractTypeRegistry.php

abstract class AbstractTypeRegistry
{
    public static function get(string $typename)
    {
        return static::fromCache($typename);
    }

    // ...
    protected static function fromCache(string $typename)
    {
    // ...
            $file = static::getSourceDirectory() . DIRECTORY_SEPARATOR . $obfuscatedName . '.php';
            if (file_exists($file ?? '')) {
                // if the file exists, then it won't throw the Exception below
    // ...      
            throw new Exception("Unknown graphql type: " . $typename);

This to me is what indicates that the __graphql-schema.php file is present, but that Q66c1b4c7f3dc385b68a9fa903ccd016d.php is not present. This aligns with michals observation that you can trigger this same behaviour by committing a schema generated with on dev HybridObfuscator (filename = Query_66c1b4c7f3dc385b68a9fa903ccd016d.php) to the repo, which doesn't match the file name searched for on test/live with HashNameObfuscator (filename = Q66c1b4c7f3dc385b68a9fa903ccd016d.php)

However, as mentioned above, this only makes sense for when you actually generated the schema locally and committed it to he repo. In the original report, the schema was auto-generated on test/live with HashNameObfuscator

elliot-sawyer commented 1 year ago

Just adding a data point to this:

I came across this error on the Files admin of a fresh install of 5.0.0-beta1. require section of composer.json looks like this

{
...
    "require": {
        "php": "^8",
        "silverstripe/recipe-cms": "5.0.0-beta1",
        "guzzlehttp/guzzle": "^7"
    },
...

I originally set this up to test out Docker on a fresh Ubuntu install, which I did using this project: https://github.com/sprintcube/docker-compose-lamp

Not sure if it did this fresh out of the box, but after I did a dev/build (again, probably my 3rd time) and it worked again.

My dev/build step also uses a flush= and occasionally gets followed up with a chown -R www-data:www-data public/assets command, but I may have done the same here with the .generated-graphql folder. I'd guess it's related to a filesystem permissions issue with .generated-graphql, but I haven't looked at it further since resolving it

emteknetnz commented 1 year ago

Spent some time forcefully replicating this issue on a multi-server setup (Silverstripe Cloud) where I had a script to manually delete the schema files on one instance only, though retain files such as .graphql-generated/admin/__graphql-schema.php - I simply had a filter to only delete files from the schema folder where the file length was greater than 30 characters. As the environment was using the HashObfustcator this targetted the correct files to delete.

I used a combination of graylog and echoing out the contents of /sys/devices/virtual/dmi/id/board_asset_tag to see which instance I was being served requests from.

The elemental area worked fine on the instance where I hadn't manually deleted the schema files. It replicated the issue of infinite spinner for editing an elemental page on the instance where I had deleted the files.

If I run dev/graphql/build on the instance that did have the schema, there didn't seem to be any rsync in place so the instance that had the broken schema did not get rebuild.

I only tested this locally, though if I delete the entire folder, including __graphql-schema.php, then the "fallback" will kick in and the schema will be rebuild on demand.

michalkleiner commented 1 year ago

Good work, must be both fun and frustrating at the same time debugging this. I think we came to a similar conclusion in multi-env setup from external observation of SCP amongst other devs from various agencies, the question we couldn't resolve externally was how it can happen that the files could be missing. Was it a new instance automatically spinning up due to an increased load that would not have the files, was it the first instance after a full infrastructure rebuild, was it a collision between the different obfuscators in different environments or perhaps even something else.

One solution I could think of back then was to make the fallback mechanism more robust in the way it would try to iterate over all the registered obfuscators to try to find the one being used before saying no files are generated if the .graphql-generated/admin/__graphql-schema.php was present, or perhaps we can even code that info into that file for which class was used to generate it.

GuySartorelli commented 1 year ago

I have a theory that provides one scenario where this could happen. Assume a simple setup where there are two servers. Server One is where most of the deployment stuff happens. This is the server sake dev/build is run on. Server Two is only used to handle traffic. It has the code deployed to it and either shares a db with Server One or has a db that is a read-only clone of Server One's db. There is no shared filesystem, or if there is, the .graphql-generated/ is not stored in that shared filesystem.

On first deployment, Server One has dev/build run, which generates its schema.
At some point, an HTTP request gets through to Server Two. Because there is no schema here yet, it builds it.
Developers implement a new elemental block. This gets deployed. Server One has de/build run, which re-generates its schema.
At some point, an HTTP request gets through to Server Two. If has the old schema, but has no way to know it's old so it doesn't rebuild it. The request needs schema for the new elemental block, which this server doesn't have.

This explains why this is resolved by including graphql schema in version control and including it as part of the deployment - both servers get a copy of the new schema as part of the deployment in that case.

This assumes a new elemental block class would have its own separate graphql schema file, which I haven't taken the time to validate.

For silverstripe cloud specifically, if this is the problem, using the silverstripe-cache directory as proposed in https://github.com/silverstripe/silverstripe-graphql/issues/465 would resolve it. Though we can't do that for v4 and have probably missed the window to do this for v5.

emteknetnz commented 1 year ago

I just tried something similar on Silverstripe Cloud, both instances had a freshly build graphql schema. I added and deployed a new block type and associated graphql yml so that new schema files would be created.

After deployment, both instances had the correct schema i.e. they both added the 5 new schema files that were to be expected. So in the case of Silverstripe Cloud at least, after deployment all the instances appear to have the correct schemas.

emteknetnz commented 1 year ago

I've created a PR to simply rebuild the schema if a requested file that is supposed to be there is missing https://github.com/silverstripe/silverstripe-graphql/pull/521

This still doesn't diagnose WHY this issue was occurring, though I hope that it'll at least solve this issue well enough, and possibly also fix some other future edge cases.

GuySartorelli commented 1 year ago

This has been patched now. The patch is released as 4.1.1 and 4.2.2

maxime-rainville commented 1 year ago

There's some problems left to address.

maxime-rainville commented 1 year ago

For anyone who applied the 4.1.2/4.2.2 patch, we just rolled those back because they introduced a DDOS vulnerability.

From the Packagist stats, it doesn't look like many people got around to install the vulnerable version, but we issue CVE-2023-28104 just to be on the safe side.

You can find the notice on our website at https://www.silverstripe.org/download/security-releases/cve-2023-28104

We'll have another go at this and aim to fix the issue ... "sans-DDOS" this time.

GuySartorelli commented 1 year ago

Lets try this again. Merged the PR - the new (rate limited) fix will be released with CMS 4.13

vinstah commented 10 months ago

Hi,

Client is getting 404 on {$URL}/_graphql/admin.types.graphql. Rebuilding schema using dev/task fixed but this isn't the first time, was first noticed when in asset admin, possibly (but not confirmed) after a deployment

emteknetnz commented 10 months ago

@vinstah Please raise as a new issue

GuySartorelli commented 10 months ago

FYI I asked that Vinnie add this context here, as it seems to be the same issue. A new issue would likely be closed as we already have a workaround in place for this in the latest version.

silverstripe / silverstripe-graphql

GraphQL returns an error, causing elemental to fail to initialise in live mode (sometimes) #500

Acceptance criteria

PRs