laravel / ideas

Issues board used for Laravel internals discussions.
939 stars 28 forks source link

PostgreSQL extensions behave erratically due to how Laravel sets PostgreSQL's "search_path" #916

Open themsaid opened 6 years ago

themsaid commented 6 years ago

Original post by @cbj4074

https://github.com/laravel/framework/issues/22333

Description

The author is of the opinion that Laravel manipulates PostgreSQL's search_path recklessly (though, with the best of intentions, I'm sure), given that it is possible to reference objects (e.g., tables) in other schemas in PostgreSQL.

For a primer regarding schemas, which are essentially "namespaces" (a concept that MySQL, for example, lacks), see:

https://www.postgresql.org/docs/current/static/ddl-schemas.html

The manipulation in question occurs at the following location in the Laravel Framework (Illuminate) source:

https://github.com/laravel/framework/blob/5.5/src/Illuminate/Database/Connectors/PostgresConnector.php#L95

The specific manner in which Extensions are implemented in PostgreSQL compounds the problem even further.

The manner in which Laravel manipulates the search_path has far-reaching implications for every aspect of the DBAL and ORM; anyone who understands Extensions in PostgreSQL can imagine what happens when Laravel paves-over the effective search_path any time it opens a new database Connection.

Imagine the following scenario:

The Laravel application uses a PostgreSQL database, in which TEXT data types are compared case-sensitively (this is PostgreSQL's default behavior).

A user registers with the application, using the email address JDoe@example.com.

To prevent a user's inability to authenticate using a variation in case, with respect to his email address, e.g., jdoe@example.com, the application developer makes use of PostgreSQL's case-insensitive text comparison extension, citext.

For implementation details regarding extensions in PostgreSQL, see:

https://www.postgresql.org/docs/current/static/sql-createextension.html

The case for using citext instead of LOWER() on both sides of the comparison operator, for example, is articulated clearly in the following post:

https://hashrocket.com/blog/posts/working-with-email-addresses-in-postgresql

Now, imagine that the application developer is using more than one schema in the application (after all, this capability is one of PostgreSQL's strengths over MySQL, for example). Imagine further that the developer defines Eloquent Models whose relationships extend across schema boundaries; that is, relationships are defined between models whose respective connections specify different schemas.

In this scenario, Eloquent (and Query Builder, for that matter) break-down completely with respect to how they interact with any PostgreSQL extensions that may be in use on the database in question.

More specifically, in many cases, extensions are enabled on the default schema (usually public), and as long as public is in the search_path, the database connection is able to take advantage of those extensions, regardless of the source and target objects' respective schema(s) (see Key Concepts, below, for a more detailed explanation). This is important because if Laravel paves-over PostgreSQL's effective search_path when establishing any connection (which is the current behavior), the database will perform completely differently where extensions are concerned, e.g., case-insensitive text comparisons can become case-sensitive in crucial scenarios. Imagine the havoc this could wreak in certain use-cases.

To continue with the aforementioned email address example, imagine further that a second (and malicious) user registers with the application as jdoe@example.com. Because the citext extension is not effective, PostgreSQL will now treat the first and second users' email addresses as distinct. While the users would have distinct id values, there are a number of junctures at which this behavior could pose non-trivial problems (e.g., password resets, SSO implementations, etc.).

Yes, there are other fail-safes that should be employed to prevent this type of scenario (e.g., unique constraint), but not everybody is so shrewd. More than anything, my intention is to demonstrate that Laravel's current behavior can cause legitimate problems that are extremely difficult to debug.

Key Concepts

1.) Crucially, just because the extension is enabled on a particular schema, it will not be used unless that schema is also in the effective search_path.

In other words, if the effective search path is, for example, "$user", public, the extension will not be used when processing this query:

SELECT "email" FROM "platform"."users";

2.) Equally crucial is that an extension is "accessible" by virtue of existing ("being created") in the effective search_path. This means that an extension can be enabled on the public schema, but be used to effect change on a different schema. Consider the following scenario.

$ sudo -H -u postgres /bin/sh -c "psql -d www -c 'CREATE EXTENSION IF NOT EXISTS citext;'"
CREATE EXTENSION

We did not specify a schema in which to create the extension, explicitly, so the default schema, public, is used (I believe this default schema to be specified in template1, which ships with PostgreSQL).

And even though the platform schema is not in the effective search_path, we are able to use the extension against it (of course, this requires specifying the schema prefix when addressing the table):

www=# SHOW search_path;
   search_path
-----------------
 "$user", public
(1 row)

www=# ALTER TABLE "platform"."users" ALTER COLUMN "email" TYPE citext;
ALTER TABLE

Likewise, we are able to execute queries that leverage the extension against schemas in which the extension does not exist:

www=# SELECT "email" FROM "platform"."users" WHERE "email" = 'bjohnson@example.com';
           email
----------------------------
 BJOHNSON@example.com
(1 row)

Possible Solutions

1.) Do less; don't manipulate the search_path at all, and make it the developer's responsibility to set the value appropriately for his specific use-case, whether he decides to do that in PostgreSQL, or within his Laravel application.

2.) At the very minimum, Laravel should never overwrite the search_path completely, primarily because the search_path can be configured in the PostgreSQL configuration in a way that Laravel cannot replicate easily (if at all). If anything, Laravel should only add schemas to the search_path, as needed; it should never remove schemas (and again, that includes overwriting them). A more elegant approach might be to have two separate connection properties for schema and search_path. I haven't experimented with that approach as yet, but it seems to have merit, and may solve this problem entirely.

3.) While not ideal, it did occur to me to add every schema in my database to the schema value in the PostgreSQL connection configuration, e.g.

'pgsql' => [
    'driver' => 'pgsql',
    'host' => env('DB_HOST', '127.0.0.1'),
    'port' => env('DB_PORT', '5432'),
    'database' => env('DB_DATABASE', 'forge'),
    'username' => env('DB_USERNAME', 'forge'),
    'password' => env('DB_PASSWORD', ''),
    'charset' => 'utf8',
    'prefix' => env('DB_PREFIX', ''),
    'schema' => 'public, mine', // <- This line
    'sslmode' => 'prefer',
],

but this is not a feasible solution for at least two reasons:

This causes the entire string to be quoted, which renders the search_path effectively invalid. For example, the php artisan tinker output demonstrates the problem:

>>> DB::select('show search_path');
=> [
     {#734
       +"search_path": ""public, mine"",
     },
   ]

Furthermore, the binding precludes the use of dynamic values that PostgreSQL understands; a textbook example is the default search_path value: "$user", public, which PostgreSQL evaluates as homestead, public, for example.

Conclusion

I'm fully aware that this problem affects a tiny fraction of Laravel users. I did ask myself:

How many Laravel users use PostgreSQL? And of those, how many are using more than one schema? And of those, how many are using PostgreSQL extensions? And of those, how many are using them in more than one schema? And of those, how many are not using the default schema, public?

A back-of-the-napkin calculation yields about 1 in 1,000,000, which explains why nobody else seems to have documented nor addressed this limitation in Laravel's PostgreSQL implementation.

Nonetheless, it seems that if Laravel is to be regarded as "Enterprise-capable", this limitation must be addressed.

Thank you for your time and consideration.

cbj4074 commented 6 years ago

Thanks for moving this to the appropriate location, @themsaid !

I should have a PR cooked-up pretty shortly. It looks like the nature of the required changes will be very minor in scope.

I will make every effort to maintain backward-compatibility, but that may come at the cost of a function whose name is a misnomer; that is, configureSchema() when it really should be configureSearchPath(). We could add a new function (two, actually) with a more apt name, but we would still have to change the calling reference to it (or implement some unsightly conditional logic).

In any case, I'll ensure that the PR is editable, so any prudent changes can be made.

Thanks again!

cbj4074 commented 6 years ago

It looks like I'll be able to implement this in a way that is mostly backward-compatible (see Caveats, below), and without any specific dependencies between https://github.com/laravel/laravel and https://github.com/laravel/framework .

As such, it seems to make sense to submit this PR first, and once (and if) it's merged-in, adjust the connection configuration property name from schema to searchpath.

Basically, if the user's framework version includes this change and the key name is still schema, the value will be ignored completely, in which case the effective search_path will fall-back to PostgreSQL's chosen value (likely the default, which is "$user", public).

Given that nobody has ever raised the concern that is the subject of this Issue, the vast majority of Laravel users who are using PostgreSQL are using the default (public) schema, in which case this change will have no unintended consequences.

Caveats

Potential Impact: Minor

If the database does in fact have a schema named "$user", which evaluates to forge, for example, in a default Laravel installation. In such cases, it would be possible for object name ambiguity to arise (most likely a table name), in which case PostgreSQL will throw an appropriate exception. The stop-gap solution would be to prefix the table name(s) with the schema name(s), whether in the Model definition or elsewhere, thus preventing the ambiguity.

Potential Impact: Major

If the user is relying on one or more extensions registered on a schema other than those defined in the default search_path ('"$user", public'), and specifies that schema as the connection's schema value.

Upon receiving this update, the schema value will be ignored (unless the user changes its name to searchpath), thereby causing the extension(s) to become ineffective for that connection.

As I described exhaustively in my OP, this could have grave consequences for a given use-case that relies on one or more extensions.

Accordingly, I suggest weighing the BC-factor carefully before tagging/releasing, or instead merging this into a release that has other BC-breaks, and including accompanying documentation regarding the specific nature of the change (i.e., a link to this Issue).

Other Thoughts

1.) My intention is to implement support for variable references in the search_path, such as the style that is used in PostgreSQL's default, "$user". Notice that the value must be quoted, whereas quoting is optional for static references. I don't see any immediate risk arising from this capability, given that these are core configuration values, and to modify them requires access to the effective user's environment, .env, or config/database.php. That said, I will make every effort to prevent unintended variable expansion, given that $ is, of course, of significance in PHP.

2.) In their native syntax, search_path values are comma-separated, with spaces between items. It seems to make sense to retain the same format (with the spaces being optional) when specifying their values in Laravel. Using a string (instead of an array) to define the values also is conducive to specifying the values within the .env file (which does not support array-style notation). If the intention is to allow for spaces (or any amount of whitespace, for that matter) within the declared values, at the Laravel layer, then some form of basic parsing/validation seems appropriate. To be explicit, my intention is not to change the existing if (is_array($searchPath)) { ... logic, so that it is, in effect, the user's responsibility to call implode() on the searchpath string, e.g., in config/database.php:

'searchpath' => !empty(env('DB_SEARCH_PATH')) ? explode(',', env('DB_SEARCH_PATH')) : 'public',

If that doesn't seem ideal for any reason, please let me know.

Initial PR soon to follow.

cbj4074 commented 6 years ago

It appears that PostgreSQL accepts the search_path string happily, with or without spaces, and then adds spaces to its own internal representation.

Also, the existing formatting method quotes the entire string (if the input is a string), which is not valid. For example

'schema' => 'public, mine',

yields

>>> DB::select('show search_path');
=> [
     {#735
       +"search_path": ""public,mine"",
     },
   ] 

which is not valid, and so will never be used (despite the fact that PostgreSQL does not complain). In other words, passing more than one schema in the string never worked correctly to begin with.

It seems likely that the intention was indeed to quote the entire string, to prevent any malicious or unintended SQL-related problems, but unless the user is aware that he must supply an array to use more than one schema in the search_path, it's easy to "do it wrong" and use a string, in which case PostgreSQL basically ignores the value silently. Personally, I would rather experience a SQL syntax violation than a silent failure.

Passing an array into the same function, for example

'searchpath' => ['$user', 'public', 'mine'],

works as it should, because PostgreSQL handles the parsing, and interprets the value as such:

>>> DB::select('show search_path');
=> [
     {#735
       +"search_path": ""$user", public, mine",
     },
   ]

So, at a minimum, it is necessary to remove the outer quotes when the supplied value is a string.

At a maximum, if the intention is to support "$user"-style notation, in strings, too, further work is required (otherwise, a SQL syntax violation occurs).

My opinion regarding the matter is that the user should simply change the searchpath value to an array if he needs to use "$user"-style notation. I hesitate to get into the ugly business of string-parsing and manipulation in such a sensitive context; it feels as though there is too much go awry.

In fact, I'm inclined to err on the side of caution and leave the outer quotes intact when the value is supplied as a string. Maybe it is better to have a silent failure than any risk of a SQL-injection. I'll leave that bit alone. ;)

cbj4074 commented 6 years ago

I apologize for the lengthy monologue here; more than anything, I needed to talk myself through the reasons for which I became so tripped-up.

In essence, my frustration and confusion stem directly from the fact that the existing code seems to conflate PostgreSQL's "schema" and "search_path" concepts. As tempting as it may be to write-off the conflation as a "mundane detail", it feels worthwhile to correct.

@themsaid I think it makes the most sense to close this Issue (at your convenience, of course), and I will open a new one with a more accurate subject and more relevant details.

Should I open it here (on laravel/internals), too?

Thanks in advance!

Talkless commented 3 years ago

Sorry @cbj4074 for bringing up old task, but it seems similar question.

We are currently researching different PHP frameworks to migrate to, but we have issue with "not mainstream" database usage:

So, the question is:

Thanks!

mfn commented 3 years ago

Unrelated, @cbj4074 it seems to be that this search_path issue will be fixed in 9.x via (your) https://github.com/laravel/framework/pull/35530 ?

cbj4074 commented 3 years ago

@Talkless No apology needed. :) I'm not a proponent of closing or archiving threads (on any platform), nor do I find reviving them to constitute "poor etiquette", as long as the "bump" is relevant, and in your case, it most certainly is.

To answer your questions:

  1. Yes. All Laravel configuration options may be specified at runtime (including database connections). It also bears mention that any number of connections may be specified in Laravel's database configuration, i.e., you are not limited to a single connection in the applicable configuration array. Given the use-case you describe, it sounds as though you would need to authenticate the user before knowing which PostgreSQL credentials to use (correct?), so, you might specify a single connection in the database configuration array, through which you're able to query your user database. Once the user is authenticated, you could store the relevant data in the session and then define the appropriate connection(s) on each subsequent request using middleware. There are probably other ways to implement this, but the point is that it is, in theory, possible, and not particularly complicated.
  2. Yes, you can certainly omit the search_path on the connection, in which case PostgreSQL's underlying value will be used. However, there are some small number of methods in which the search_path is inferred if not set on the connection and not specified explicitly when calling the method, specifically because those cases require a value for it. As a specific example, Laravel's PostgreSQL Builder implementation defines a hasTable() method, which queries PostgreSQL's information_schema table to determine whether a specific table exists. Such a query is infeasible without knowing which schema value to use in the WHERE clause. Accordingly, sensible assumptions are made in these cases. While I'm not sure this will affect your use-case, and even if it does, you should be able to specify the required values explicitly at call-time, you may examine the relevant logic in https://github.com/laravel/framework/blob/ce57929931f877f03a8428498654591e890f55f8/src/Illuminate/Database/Schema/PostgresBuilder.php#L161-L223 .
  3. Yes. You can set the search_path before executing a query, either by setting the value on the connection to be used before connecting, or you can set it on an established connection using a raw statement (which enables you to execute any arbitrary SQL statement against PostgreSQL, including SET search_path ...).

Ultimately, as of Laravel 9.x, I don't see any deal-breakers, given the use-case you describe.

cbj4074 commented 3 years ago

@mfn Yes, the search_path handling is fixed in 9.x, but only partly via https://github.com/laravel/framework/pull/35530 . There are two additional PRs that fix lingering issues that arose subsequently: https://github.com/laravel/framework/pull/35567 and https://github.com/laravel/framework/pull/35588 .

cbj4074 commented 3 years ago

@themsaid You can feel free to close this, given that it has been implemented/fixed in Laravel 9.x.

Talkless commented 3 years ago

Ultimately, as of Laravel 9.x, I don't see any deal-breakers, given the use-case you describe.

Looks great, thanks @cbj4074 for detailed response!