Closed xzyfer closed 3 years ago
This is also impacting us. Downgrading to 1.9.0 fixes it. We don't use inline SSL certificates (the feature in 1.10).
We also ran into this bug. And indeed v1.9.0 seems to not surface the same issue.
The bug seems to be caused in the ssl
method when upgrading a connection.
For us this happened in the cancel
method in watchCancel
, but the same apparently happens in the open
method on a Connector.
The ssl
method calls the ssl
function, which deletes a key from the passed values
map with delete(o, "sslinline")
.When this happens concurrently, i.e. on multiple connections at the same time, Go panics because of concurrent writes to the map. The same should happen with a concurrent map read and write.
I think (but I might be wrong on this) that this bug already somewhat existed before, for instance with the delete(o, "sslrootcert")
, but that it was just way less likely to cause a panic since it is first checked whether the key exists with if sslrootcert, ok := o["sslrootcert"]; ok {
.
I believe the problem causing the bug is that the values
map is created once in NewConnector
, and then subsequently set as opts
in all created connections. Even though the map is set by value, the underlying data is the same (it only copies the reference to the map). That would not be a problem if the map is only read from, but it is a problem if the map is also written to since in Go maps are not safe for concurrent use (see https://golang.org/doc/faq#atomic_maps).
Some possible solutions for this bug:
values
type a struct that contains the map[string]string
but also a sync.RWMutex
. To read from the map the read lock should be taken, while the write lock should be taken when writing or deleting. This is also the solution given at https://blog.golang.org/maps in the "Concurrency" section. This can have some performance impact.values
map. There appear to be only a few writes/deletes to the map in the ssl
, sslClientCertificates
and sslCertificateAuthority
functions. Removing those would make it so the map is only read from which is safe to do concurrently. This might however cause side effects if the keys are removed from the map for a reason, as is implied by this comment above one of the deletes: // This pseudo-parameter is not recognized by the PostgreSQL server, so let's delete it after use.
On a side note:
We don't currently use it, but would the in v1.10.0 added sslinline functionality even work for connections created after the first one?
It seems to me that since the "sslcert" and "sslkey" key values are set to "", and the "sslinline" key is deleted from the map after first use, the upgrade function that is returned will only contain a valid certificate the first time it is returned. And since that returned upgrade function is only used to upgrade one connection in the ssl
method and is then discarded, the connections created afterwards might have trouble upgrading to ssl.
Bjorn's analysis appears to be spot on. This appears to be a regression in v1.10.0 via #818 by @eirslett.
Sorry for the bug! 😢
Just a thought about this one:
And since that returned upgrade function is only used to upgrade one connection in the ssl method and is then discarded, the connections created afterwards might have trouble upgrading to ssl.
Doesn't that contradict this?
the values map is created once in NewConnector, and then subsequently set as opts in all created connections.
Anyways, times have changed, I'm a full time frontend developer now. I haven't really touched PostgreSQL in months. My original PR was written 3 years ago - it was merged two months ago, but all I really did was to rebase my patch on top of the master branch.
@bjornouderoelink seems like you have quite a deep understanding of this bug, would it be possible for you to make a pull request with a bug fix? Unfortunately, I'm afraid that I don't have time to get back into PostgreSQL land, write a bugfix and then make sure I don't introduce a new bug. (Maybe, in the future, will I once again work with databases, but nowadays it's all about React for me.)
If it's too hard to fix the bug, I suppose another alternative would be to revert my patch. But then again, it looks like other people (like @umran and @binlab) are already relying on it, so that would be unfortunate.
It indeed looks like it contradicts, but I do believe it is correct.
The sql.Open
function returns a sql.DB
, which actually has no real connection to the DB (yet), instead it maintains a connection pool which is empty at first. To actually open new connections, sql.DB
contains a Connector
.
For pq this is the pq.Connector
returned from NewConnector
, which contains the opts
values map. When a new connection is needed (e.g. when the connection pool is empty), one is created using the Connect
method on the Connector
. That method calls the open
method which puts the opts
values map into the opts
field of the returned conn
, which is the actual connection that will be used.
In Go "maps hold references to an underlying data structure" (see https://golang.org/doc/effective_go#maps), so when data is written or deleted in the opts
of a conn
, the opts
for any other conn
and the Connector
will change as well. So when one connection upgrades itself to SSL, which writes/deletes data from the values map, those map fields will also change in all other (future) connections.
That is what seems to cause the concurrent map writes when multiple connections do that at the same time. And that is also why I think that only the first connection would be able to upgrade to SSL when using the added inline SSL functionality, since connections afterwards would not be able to read the "sslinline" from opts
anymore.
As for a solution:
While I think the two solutions in my previous comment would fix the concurrent map writes panic, they would not fix the inline SSL functionality as well.
I thought the easiest solution to solve both would be to create a new map with a copy of all data for each connection, so the maps do not reference the same underlying data. The values map in the Connector
is then the source, but changes made to the map in each conn
won't affect other connections. That would make it so that concurrent writes from different connections won't happen, and each connection can upgrade to SSL since they will all be able to read "sslinline" before it is deleted from the map in that connection.
I was able to reproduce the panic quite consistently before. I could not reproduce it anymore after applying that fix, which would suggests it indeed solves the bug. Though with concurrency this is always tricky because the bug could only occur with the right timings. Also, I can't really test whether the inline SSL functionality works now as we don't use it ourselves and I don't know how I would go about testing that. I can create a PR for this solution.
There is still one problem with the inline SSL functionality though that I think is not fixed entirely with my current solution.
This is the case when cancel
is called on a connection that has upgraded to SSL. The upgrade of that connection will have deleted the "sslinline" key from the map in that conn, so if then in cancel
a new values map copy is created to set as opts
in the newly created connection, it does not contain the "sslinline" key.
If my assessment is correct then this would mean that the connection which is only used in that method cannot upgrade itself to SSL. And I currently can't think of a good solution for that.
Is it possible to solve this using two different maps? One with the sslinline still intact, and another one where it's deleted? (And otherwise, the two maps would be identical)
Well, yes. That might be a solution.
I think it would be possible to add an originalOpts
field to a conn
, which contains the same data as the original unchanged opts field.
This could then even be a map that references the same underlying data as the map in the Connector
to save space. Then that data should be copied before using it for the new connection in cancel
.
I am wondering though, and maybe you know/remember this @eirslett, the comment above the delete
mentions that the "sslinline" key is a pseudo-parameter and is therefore deleted.
However, I have performed a quick search for the usage of opts
, but I don't really see any code in pq where the it is used to send something to Postgres.
If that is indeed the case, would the delete(o, "sslinline")
and writes to "sslcert" and "sslkey" with "" even be necessary?
That still leaves the delete(o, "sslrootcert")
requiring the change above, but then the inline SSL functionality should work without adding the extra map.
I think the reason I added the delete, was to prevent the private key from being passed around to wherever it's supposed to not go - whether that was to other parts of the program itself (maybe inadvertently in debug logs), or over TCP to the actual PostgreSQL server. Maybe I realized it was sent over the wire to the PostgreSQL server, and the connection was dropped because of the unrecognized connection parameter.
The code I wrote says
// This pseudo-parameter is not recognized by the PostgreSQL server, so let's delete it after use.
delete(o, "sslinline")
which I believe was an actual problem. (PostgreSQL dropping the connection)
And
// Clear out these params, in case they were to be sent to the PostgreSQL server by mistake
o["sslcert"] = ""
o["sslkey"] = ""
which was probably my paranoia. (Prevent secrets from leaking)
I agree that it would not be great if that data is sent to places it should not go. Especially so if that drops the connection to the database.
For now though I'm a bit hesitant to add the originalOpts
solution as it would solve the issue by adding more code, while it might be possible to solve the issue by removing the delete
if that turns out to not really do anything.
~And because I can't seem to find any code that sends those keys from the opts
~, that might be a more elegant solution performance and maintainability wise.
Maybe @mjibson or any other maintainer can weigh in on this? Or someone that actively uses the inline SSL functionality?
Otherwise I will probably open a PR later this week with the originalOpts
solution you proposed, and then see what happens from there.
Edit: nevermind, I found something.
In the startup
method on a connection, the isDriverSetting
is called, which returns true if the key is only meant for the driver, which will skip sending it to Postgres. The "sslcert" and "sslkey" keys are in here and return true.
The "sslinline" key however is not in there, and would by default return false, thus it gets sent to Posgres.
In that case, I think we could add the "sslinline" key there and make it return true, so it is not sent to Postgres.
That would then allow the delete
to be removed.
I don't care at all. If there's a PR that passes tests and ya'll think it's good I'll hit merge. That's pretty much what I do nowadays. Mention me in a PR when it's ready.
Oh, just straight up reverting the commit is also fine. I have no preference.
I opened a PR for this issue. With the fix, I was not able to reproduce the concurrent map writes anymore. As for the sslinline functionality, adding the sslinline option to the connection string does not cause the connection to drop. Maybe someone who uses the sslinline functionality can test whether that works as expected.
The CI tests show a failure, however in the details all tests succeed except for the goimports one. It seems however that that one fails for all PRs that have been opened lately, and only in the CI tests that use go:master. So I think the CI tests showing a failure has to do with something between go:master and the goimports tool, since the fix did not touch any imports and there are also no unused imports in the changed files.
I don't think that copying the values is necessary. Perhaps values
could be extended with two extra fields which would allow the delete
to happen much earlier and thereby avoiding the concurrent modification. That approach would also solve a suboptimal situation that is currently present in #1033.
I agree that copying the values
might not be an optimal solution because of the (slightly) increased memory usage for each new connection.
The reason I left the values
copy in the PR, even though the writes for sslinline are removed now, is because there seem to be more writes that could maybe happen concurrently such as the delete(o, "sslrootcert")
and also o["password"] = split[4]
in handlePgPass
which I felt I did not know enough about to remove. At the same time the copy also prevents potential future concurrency bugs where the map would be read/written when another connection writes to it.
I don't really know which two fields you would want add to the values
or what exactly you mean with performing the delete
earlier, since the delete(o, "sslinline")
is removed now. But if you know a way to prevent the concurrency issues while not requiring the copy, that would be nice. Maybe you can create a PR for that?
After upgrading from 1.9.0 to 1.10.0 we started seeing the following panic.
There have been issues about this which have been related to SSL options. Since 1.10 touched the SSL space it's possible there has been a regression.