lib / pq

Pure Go Postgres driver for database/sql
https://pkg.go.dev/github.com/lib/pq
MIT License
8.89k stars 906 forks source link

Concurrent map writes #1031

Closed xzyfer closed 3 years ago

xzyfer commented 3 years ago

After upgrading from 1.9.0 to 1.10.0 we started seeing the following panic.

fatal error: concurrent map writes
goroutine 52636306 [running]:
runtime.throw(0xb54c60, 0x15
    /usr/local/go/src/runtime/panic.go:1117 +0x72 fp=0xc0015aa8c8 sp=0xc0015aa898 pc=0x4382f2
runtime.mapdelete_faststr(0xaa6d60, 0xc0004014a0, 0xb4c4a1, 0x9
    /usr/local/go/src/runtime/map_faststr.go:306 +0x3de fp=0xc0015aa930 sp=0xc0015aa8c8 pc=0x414d1e
github.com/lib/pq.ssl(0xc0004014a0, 0x1000000000011, 0x1, 0x3
    /go/pkg/mod/github.com/lib/pq@v1.10.0/ssl.go:63 +0x178 fp=0xc0015aa9a0 sp=0xc0015aa930 pc=0x9c3c18
github.com/lib/pq.(*conn).ssl(0xc001f922c0, 0xc0004014a0, 0xc15678, 0xc0003346c0
    /go/pkg/mod/github.com/lib/pq@v1.10.0/conn.go:1064 +0x45 fp=0xc0015aaa40 sp=0xc0015aa9a0 pc=0x9b06c5
github.com/lib/pq.(*Connector).open(0xc00009c108, 0xc1a6a8, 0xc0024ebb60, 0xc001f922c0, 0x0, 0x0
    /go/pkg/mod/github.com/lib/pq@v1.10.0/conn.go:321 +0x245 fp=0xc0015aab48 sp=0xc0015aaa40 pc=0x9aad45
time.Sleep(0x12a05f200
    /usr/local/go/src/runtime/time.go:193 +0xd2

There have been issues about this which have been related to SSL options. Since 1.10 touched the SSL space it's possible there has been a regression.

jakedt commented 3 years ago

This is also impacting us. Downgrading to 1.9.0 fixes it. We don't use inline SSL certificates (the feature in 1.10).

bjornouderoelink commented 3 years ago

We also ran into this bug. And indeed v1.9.0 seems to not surface the same issue.

The bug seems to be caused in the ssl method when upgrading a connection. For us this happened in the cancel method in watchCancel, but the same apparently happens in the open method on a Connector. The ssl method calls the ssl function, which deletes a key from the passed values map with delete(o, "sslinline").When this happens concurrently, i.e. on multiple connections at the same time, Go panics because of concurrent writes to the map. The same should happen with a concurrent map read and write.

I think (but I might be wrong on this) that this bug already somewhat existed before, for instance with the delete(o, "sslrootcert"), but that it was just way less likely to cause a panic since it is first checked whether the key exists with if sslrootcert, ok := o["sslrootcert"]; ok {.

I believe the problem causing the bug is that the values map is created once in NewConnector, and then subsequently set as opts in all created connections. Even though the map is set by value, the underlying data is the same (it only copies the reference to the map). That would not be a problem if the map is only read from, but it is a problem if the map is also written to since in Go maps are not safe for concurrent use (see https://golang.org/doc/faq#atomic_maps).

Some possible solutions for this bug:

On a side note: We don't currently use it, but would the in v1.10.0 added sslinline functionality even work for connections created after the first one? It seems to me that since the "sslcert" and "sslkey" key values are set to "", and the "sslinline" key is deleted from the map after first use, the upgrade function that is returned will only contain a valid certificate the first time it is returned. And since that returned upgrade function is only used to upgrade one connection in the ssl method and is then discarded, the connections created afterwards might have trouble upgrading to ssl.

Lekensteyn commented 3 years ago

Bjorn's analysis appears to be spot on. This appears to be a regression in v1.10.0 via #818 by @eirslett.

eirslett commented 3 years ago

Sorry for the bug! 😢

Just a thought about this one:

And since that returned upgrade function is only used to upgrade one connection in the ssl method and is then discarded, the connections created afterwards might have trouble upgrading to ssl.

Doesn't that contradict this?

the values map is created once in NewConnector, and then subsequently set as opts in all created connections.

Anyways, times have changed, I'm a full time frontend developer now. I haven't really touched PostgreSQL in months. My original PR was written 3 years ago - it was merged two months ago, but all I really did was to rebase my patch on top of the master branch.

@bjornouderoelink seems like you have quite a deep understanding of this bug, would it be possible for you to make a pull request with a bug fix? Unfortunately, I'm afraid that I don't have time to get back into PostgreSQL land, write a bugfix and then make sure I don't introduce a new bug. (Maybe, in the future, will I once again work with databases, but nowadays it's all about React for me.)

If it's too hard to fix the bug, I suppose another alternative would be to revert my patch. But then again, it looks like other people (like @umran and @binlab) are already relying on it, so that would be unfortunate.

bjornouderoelink commented 3 years ago

It indeed looks like it contradicts, but I do believe it is correct. The sql.Open function returns a sql.DB, which actually has no real connection to the DB (yet), instead it maintains a connection pool which is empty at first. To actually open new connections, sql.DB contains a Connector. For pq this is the pq.Connector returned from NewConnector, which contains the opts values map. When a new connection is needed (e.g. when the connection pool is empty), one is created using the Connect method on the Connector. That method calls the open method which puts the opts values map into the opts field of the returned conn, which is the actual connection that will be used. In Go "maps hold references to an underlying data structure" (see https://golang.org/doc/effective_go#maps), so when data is written or deleted in the opts of a conn, the opts for any other conn and the Connector will change as well. So when one connection upgrades itself to SSL, which writes/deletes data from the values map, those map fields will also change in all other (future) connections. That is what seems to cause the concurrent map writes when multiple connections do that at the same time. And that is also why I think that only the first connection would be able to upgrade to SSL when using the added inline SSL functionality, since connections afterwards would not be able to read the "sslinline" from opts anymore.

As for a solution: While I think the two solutions in my previous comment would fix the concurrent map writes panic, they would not fix the inline SSL functionality as well. I thought the easiest solution to solve both would be to create a new map with a copy of all data for each connection, so the maps do not reference the same underlying data. The values map in the Connector is then the source, but changes made to the map in each conn won't affect other connections. That would make it so that concurrent writes from different connections won't happen, and each connection can upgrade to SSL since they will all be able to read "sslinline" before it is deleted from the map in that connection.

I was able to reproduce the panic quite consistently before. I could not reproduce it anymore after applying that fix, which would suggests it indeed solves the bug. Though with concurrency this is always tricky because the bug could only occur with the right timings. Also, I can't really test whether the inline SSL functionality works now as we don't use it ourselves and I don't know how I would go about testing that. I can create a PR for this solution.

There is still one problem with the inline SSL functionality though that I think is not fixed entirely with my current solution. This is the case when cancel is called on a connection that has upgraded to SSL. The upgrade of that connection will have deleted the "sslinline" key from the map in that conn, so if then in cancel a new values map copy is created to set as opts in the newly created connection, it does not contain the "sslinline" key. If my assessment is correct then this would mean that the connection which is only used in that method cannot upgrade itself to SSL. And I currently can't think of a good solution for that.

eirslett commented 3 years ago

Is it possible to solve this using two different maps? One with the sslinline still intact, and another one where it's deleted? (And otherwise, the two maps would be identical)

bjornouderoelink commented 3 years ago

Well, yes. That might be a solution. I think it would be possible to add an originalOpts field to a conn, which contains the same data as the original unchanged opts field. This could then even be a map that references the same underlying data as the map in the Connector to save space. Then that data should be copied before using it for the new connection in cancel.

I am wondering though, and maybe you know/remember this @eirslett, the comment above the delete mentions that the "sslinline" key is a pseudo-parameter and is therefore deleted. However, I have performed a quick search for the usage of opts, but I don't really see any code in pq where the it is used to send something to Postgres. If that is indeed the case, would the delete(o, "sslinline") and writes to "sslcert" and "sslkey" with "" even be necessary? That still leaves the delete(o, "sslrootcert") requiring the change above, but then the inline SSL functionality should work without adding the extra map.

eirslett commented 3 years ago

I think the reason I added the delete, was to prevent the private key from being passed around to wherever it's supposed to not go - whether that was to other parts of the program itself (maybe inadvertently in debug logs), or over TCP to the actual PostgreSQL server. Maybe I realized it was sent over the wire to the PostgreSQL server, and the connection was dropped because of the unrecognized connection parameter.

The code I wrote says

// This pseudo-parameter is not recognized by the PostgreSQL server, so let's delete it after use.
delete(o, "sslinline")

which I believe was an actual problem. (PostgreSQL dropping the connection)

And

// Clear out these params, in case they were to be sent to the PostgreSQL server by mistake
o["sslcert"] = ""
o["sslkey"] = ""

which was probably my paranoia. (Prevent secrets from leaking)

bjornouderoelink commented 3 years ago

I agree that it would not be great if that data is sent to places it should not go. Especially so if that drops the connection to the database.

For now though I'm a bit hesitant to add the originalOpts solution as it would solve the issue by adding more code, while it might be possible to solve the issue by removing the delete if that turns out to not really do anything. ~And because I can't seem to find any code that sends those keys from the opts~, that might be a more elegant solution performance and maintainability wise.

Maybe @mjibson or any other maintainer can weigh in on this? Or someone that actively uses the inline SSL functionality? Otherwise I will probably open a PR later this week with the originalOpts solution you proposed, and then see what happens from there.

Edit: nevermind, I found something. In the startup method on a connection, the isDriverSetting is called, which returns true if the key is only meant for the driver, which will skip sending it to Postgres. The "sslcert" and "sslkey" keys are in here and return true. The "sslinline" key however is not in there, and would by default return false, thus it gets sent to Posgres.

In that case, I think we could add the "sslinline" key there and make it return true, so it is not sent to Postgres. That would then allow the delete to be removed.

maddyblue commented 3 years ago

I don't care at all. If there's a PR that passes tests and ya'll think it's good I'll hit merge. That's pretty much what I do nowadays. Mention me in a PR when it's ready.

maddyblue commented 3 years ago

Oh, just straight up reverting the commit is also fine. I have no preference.

bjornouderoelink commented 3 years ago

I opened a PR for this issue. With the fix, I was not able to reproduce the concurrent map writes anymore. As for the sslinline functionality, adding the sslinline option to the connection string does not cause the connection to drop. Maybe someone who uses the sslinline functionality can test whether that works as expected.

The CI tests show a failure, however in the details all tests succeed except for the goimports one. It seems however that that one fails for all PRs that have been opened lately, and only in the CI tests that use go:master. So I think the CI tests showing a failure has to do with something between go:master and the goimports tool, since the fix did not touch any imports and there are also no unused imports in the changed files.

Lekensteyn commented 3 years ago

I don't think that copying the values is necessary. Perhaps values could be extended with two extra fields which would allow the delete to happen much earlier and thereby avoiding the concurrent modification. That approach would also solve a suboptimal situation that is currently present in #1033.

bjornouderoelink commented 3 years ago

I agree that copying the values might not be an optimal solution because of the (slightly) increased memory usage for each new connection. The reason I left the values copy in the PR, even though the writes for sslinline are removed now, is because there seem to be more writes that could maybe happen concurrently such as the delete(o, "sslrootcert") and also o["password"] = split[4] in handlePgPass which I felt I did not know enough about to remove. At the same time the copy also prevents potential future concurrency bugs where the map would be read/written when another connection writes to it.

I don't really know which two fields you would want add to the values or what exactly you mean with performing the delete earlier, since the delete(o, "sslinline") is removed now. But if you know a way to prevent the concurrency issues while not requiring the copy, that would be nice. Maybe you can create a PR for that?