go-spatial / tegola

Tegola is a Mapbox Vector Tile server written in Go
http://tegola.io/
MIT License
1.28k stars 196 forks source link

chore: update pgx to v4 #820

Closed iwpnd closed 2 years ago

iwpnd commented 2 years ago

Hey, as discussed I took a look at the pgx upgrade you wanted to do. It involved some refactoring as you expected.

Some things that took me a while were:

coveralls commented 2 years ago

Pull Request Test Coverage Report for Build 2ad6bcf46-PR-820


Changes Missing Coverage Covered Lines Changed/Added Lines %
atlas/map.go 0 1 0.0%
provider/postgis/util.go 14 20 70.0%
server/handle_map_layer_zxy.go 1 7 14.29%
provider/postgis/postgis.go 79 114 69.3%
<!-- Total: 94 142 66.2% -->
Files with Coverage Reduction New Missed Lines %
provider/postgis/postgis.go 4 56.11%
<!-- Total: 4 -->
Totals Coverage Status
Change from base Build b55bca432: 0.06%
Covered Lines: 5484
Relevant Lines: 12131

💛 - Coveralls
ARolek commented 2 years ago

Well, this is incredible! I will get a review done this week. Thank you for the contribution!!!

ARolek commented 2 years ago

@iwpnd can you rebase the v0.15.x branch onto this one? I updated our CI so it would build some test binaries on PRs. That way if someone wants to help with QA they don't need to build from source.

iwpnd commented 2 years ago

How do you do local QA @ARolek? Maybe something I can help you with?

ARolek commented 2 years ago

@iwpnd apologies for the slow response here. I had to vacate my family from the Marshall Fire. Our home survived but over 1000 homes are gone. Truly a sad situation.

For local QA I'm planning on spinning up Postgres 13 and Postgres 14 with some datasets to make sure everything works as expected. Did you do any of this type of testing locally?

iwpnd commented 2 years ago

@iwpnd apologies for the slow response here. I had to vacate my family from the Marshall Fire. Our home survived but over 1000 homes are gone. Truly a sad situation.

Don't apologize for that. Glad to hear you and your family are okay.

For local QA I'm planning on spinning up Postgres 13 and Postgres 14 with some datasets to make sure everything works as expected. Did you do any of this type of testing locally?

Spun up a pg13/PostGIS3.1 instance with some of our data and it worked. Wouldn't call it testing yet though. No TLS connection or anything involved. When I'm back from vacation I thought about forking an image with the change and run it on our staging cluster with some more live data.

ARolek commented 2 years ago

Sounds good! I will report my findings as well once I get back into the swing of things. I appreciate your contributions very much. Enjoy yoru trip!

ARolek commented 2 years ago

I have successfully tested this PR against Postgres 14 using the Natural Earth dataset. Looking good! I still need to test the TLS support, but no issues encountered thus far.

iwpnd commented 2 years ago

I'll be forking the image today and will try to connect to our pg13.4/3.1 cluster using TLS. I'll keep you posted here.

update: no issues on the TLS connection itself, but I'm getting quite some amount of these:

2022-01-07 16:11:03 [ERROR] handle_map_layer_zxy.go:168: error marshalling tile: timeout: context canceled
ARolek commented 2 years ago

@iwpnd I just saw your update and doing some additional testing I'm hitting the same issue. I found the fix. We need to change our error statements from:

return fmt.Errorf("error replacing layer tokens for layer (%v) SQL (%v): %v", layer, sql, err)

To

return fmt.Errorf("error replacing layer tokens for layer (%v) SQL (%v): %w", layer, sql, err)

Note the change of %v -> %w. This wraps the error check in atlas/map.go fails to identify the context.Canceled error:

if err != nil {
    switch {
    case errors.Is(err, context.Canceled):
        // Do nothing if we were cancelled.

    default:
        z, x, y := tile.ZXY()
        // TODO (arolek): should we return an error to the response or just log the error?
        // we can't just write to the response as the waitgroup is going to write to the response as well
        log.Printf("err fetching tile (z: %v, x: %v, y: %v) features: %v", z, x, y, err)
    }
    return
}

Can you please update the errors in the postgis package to use wrapping?

iwpnd commented 2 years ago

While this is not the error that is raised on my end, I see what you mean and will address it.

My error occurs in atlas/handle_map_layer_zxy.go and following your example this:

    if err != nil {
        switch err {
        case context.Canceled:
            // TODO: add debug logs
            return
        default:
            errMsg := fmt.Sprintf("error marshalling tile: %v", err)
            log.Error(errMsg)
            http.Error(w, errMsg, http.StatusInternalServerError)
            return
        }
    }

should actually be:

    if err != nil {
        switch {
        case errors.Is(err, context.Canceled):
            // TODO: add debug logs
            return
        default:
            errMsg := fmt.Sprintf("error marshalling tile: %v", err)
            log.Error(errMsg)
            http.Error(w, errMsg, http.StatusInternalServerError)
            return
        }
    }

right? This solved the issue.

Update 1:

Now I'm left with occaisonal:

2022-01-08 11:41:34 [ERROR] handle_map_layer_zxy.go:169: error marshalling tile: failed to connect to `host=localhost user=xyz database=geometries`: dial error (dial tcp 127.0.0.1:5433: operation was canceled)

:|

Update 2: PostgreSQL logs:

database          | 2022-01-08 11:20:15.496 UTC [676] ERROR:  canceling statement due to user request
database          | 2022-01-08 11:20:15.496 UTC [676] STATEMENT:  SELECT ((SELECT ST_AsMVT(q,'all',4096,'geom','gid') AS data FROM (    SELECT ST_AsMVTGeom(geom, ST_MakeEnvelope(13.18359374816483,52.268157368527206,13.359374998140362,52.37559917141592,4326)) as geom,gid,id,name,description,type,ref_id as refId,rootzone_id as rootZoneId,start_at as startAt,end_at as endAt,created_at as createdAt,updated_at as updatedAt FROM geometries WHERE type in ('business', 'policy') AND geom && ST_MakeEnvelope(13.18359374816483,52.268157368527206,13.359374998140362,52.37559917141592,4326)) AS q)) AS data

for those cases

Update 3:

Executing a query by not passing the ctx gets rid of the error, but I do not yet have the necessary understanding as to why and its implication.

rows, err := p.pool.Query(context.Background(), sql)
ARolek commented 2 years ago

Executing a query by not passing the ctx gets rid of the error, but I do not yet have the necessary understanding as to why and its implication.

Can you log the error type so we can see what pgx is returning?

log.Printf("error is of type %T, err: %v", err, err)

We might just need to add an additional check if pgx is not wrapping the context.Canceled error correctly.

iwpnd commented 2 years ago

Hey @ARolek, sorry for the late reply, I'm back on the grind after vacation.

2022-01-11 09:22:13 [ERROR] handle_map_layer_zxy.go:169: error marshalling tile: failed to connect to `host=localhost user=samson database=geometries`: dial error (dial tcp 127.0.0.1:5433: operation was canceled) (*pgconn.connectError)
gdey commented 2 years ago

Hey @ARolek, sorry for the late reply, I'm back on the grind after vacation.

2022-01-11 09:22:13 [ERROR] handle_map_layer_zxy.go:169: error marshalling tile: failed to connect to `host=localhost user=samson database=geometries`: dial error (dial tcp 127.0.0.1:5433: operation was canceled) (*pgconn.connectError)

Looking at this you may need to unwrap the error a bit more. It looks like the connectError properly has an Unwrap method, so one of the errors returned by the error being cased here is not wrapping the context.Cancelled error correctly. I cannot dig further as DialFunc is dynamic to the configuration.

ARolek commented 2 years ago

I also found this related issue: https://github.com/jackc/pgx/issues/933. I think what @gdey is saying might be correct, we're going to need to unwrap the errors from the pgx client and check if they're context canceled errors to catch this. Seems not ideal. If we don't handle it, I think we're going to see a lot of these context canceled errors in the logs.

ARolek commented 2 years ago

I made a local update to wrap the errors in the messages with %w instead of %v. I'm still seeing a bunch of error logs though:

map.go:330: err fetching tile (z: 7, x: 23, y: 47) features: error running layer (ne_10m_land) SQL (SELECT ST_AsBinary(wkb_geometry) AS geometry, ogc_fid, featurecla, min_zoom FROM ne_10m_land WHERE wkb_geometry && ST_MakeEnvelope(-1.2841420750122068e+07,5.00448511519043e+06,-1.2518550742690431e+07,5.32735512262207e+06,3857)): failed to connect to `host=<redacted> user=<redacted> database=natural_earth`: dial error (dial tcp 192.168.5.5:5432: operation was canceled)

"Operation was canceled" does not seem to be a context.Canceled error though. I'm also seeing:

2022/01/12 15:20:02 map.go:330: err fetching tile (z: 9, x: 86, y: 189) features: error running layer (ne_10m_admin_1_label_points) SQL (SELECT ST_AsBinary(wkb_geometry) AS geometry, ogc_fid, name, scalerank FROM ne_10m_admin_1_label_points WHERE wkb_geometry && ST_MakeEnvelope(-1.3307380874483643e+07,5.164697126453858e+06,-1.3226663372625733e+07,5.245414628311767e+06,3857)): failed to connect to `host=<redacted> user=<redacted> database=natural_earth`: failed SASL auth (write tcp 192.168.2.92:60731-><redacted>:5432: i/o timeout)

That does not happen on startup, but I believe as the pool is allocating more connections this error is showing up for each new connection being added to the pool. I'm still investigating but want to share my notes.

iwpnd commented 2 years ago

https://github.com/jackc/pgx/issues/1109

Apparently we need to check rows.Err() after rows.Next(), maybe a pointer. Because:

Not all errors can be detected when Query returns. You must read the results of the query and then check rows.Err() after that.

Regarding the TLS connection. There's been a major change in v4 as to how fallback connections and TLS is handled. As I noted in my initial comment, I only skip the verification in most cases.

ARolek commented 2 years ago

Apparently we need to check rows.Err() after rows.Next(), maybe a pointer. Because:

I have reviewed the code and we are doing this already. It's usually the last statement in the method:

return rows.Err()
iwpnd commented 2 years ago

Good morning @ARolek! I took a look at your changes and they were not sufficient to quiet the errors. Both the connection and context canceled was spammed continuously:

2022-01-14 08:24:53 [ERROR] handle_map_layer_zxy.go:168: error marshalling tile: timeout: context canceled
2022-01-14 08:24:53 [ERROR] handle_map_layer_zxy.go:168: error marshalling tile: failed to connect to `host=localhost user=samson database=geometries`: dial error (dial tcp 127.0.0.1:5433: operation was canceled)
2022-01-14 08:24:57 [ERROR] handle_map_layer_zxy.go:168: error marshalling tile: timeout: context canceled

I took your advice and did two things. First I made sure that i handle the error in dbconfig.AfterConnect properly. Next up, I took a look at the switch statement in handle_map_layer_xyz where the error was originating and use the same logic that you did in atlas/map.go which quieted the errors entirely. Can you please take a look at it too?

Why exactly is the context canceled here anyways, and what does it entail?

ARolek commented 2 years ago

Why exactly is the context canceled here anyways, and what does it entail?

Good question. So context Canceleing is really important in this codebase as it accounts for the situation when users pan the map before a tile has been fully fetched, encoded and returned to the client. Without proper context handling, we're processing requests that are no longer needed.