google / go-cloud

The Go Cloud Development Kit (Go CDK): A library and tools for open cloud development in Go.
https://gocloud.dev/
Apache License 2.0
9.53k stars 809 forks source link

docstore/awsdynamodb panics on iter.Next() #3405

Open thisisaaronland opened 6 months ago

thisisaaronland commented 6 months ago

Describe the bug

Calling iter.Next on a docstore/awsdynamodb derived Query instance panics with an "index out of range" error along the lines of:

                panic: runtime error: index out of range [0] with length 0

                goroutine 98 [running]:
                gocloud.dev/docstore/awsdynamodb.(*documentIterator).Next(0x14000056050, {0x101ba36b0?, 0x10258bb40?}, {{0x101b386e0, 0x1400019bea0}, 0x0, {0x101b454e0, 0x1400019bea0, 0x199},\
 {0x140004fc808, ...}})
                        /usr/local/sfomuseum/go-activitypub/vendor/gocloud.dev/docstore/awsdynamodb/query.go:492 +0x260

The Query itself has two "Where" statements bracketing a date range:

https://github.com/sfomuseum/go-activitypub/blob/main/accounts_database_docstore.go#L45-L69

Curiously, this problem only seems to affect one table (accounts). The same code (essentially) run against a different table (deliveries) does not panic:

https://github.com/sfomuseum/go-activitypub/blob/main/deliveries_database_docstore.go#L46-L58

To Reproduce

The code in question is here:

https://github.com/sfomuseum/go-activitypub/blob/main/cmd/counts-for-date/main.go

Specifically:

https://github.com/sfomuseum/go-activitypub/blob/main/stats/counts.go#L44

Which is run like this:

go run cmd/counts-for-date/main.go \
   -accounts-database-uri 'awsdynamodb://accounts?partition_key=Id&allow_scans=true&region={REGION}&credentials={CREDENTIALS}' \
   -blocks-database-uri 'awsdynamodb://blocks?partition_key=Id&allow_scans=true&region={REGION}&credentials={CREDENTIALS}' \
   -boosts-database-uri 'awsdynamodb://boosts?partition_key=Id&allow_scans=true&region={REGION}&credentials={CREDENTIALS}' \
   -deliveries-database-uri 'awsdynamodb://deliveries?partition_key=Id&allow_scans=true&region={REGION}&credentials={CREDENTIALS}' \
   -followers-database-uri 'awsdynamodb://followers?partition_key=Id&allow_scans=true&region={REGION}&credentials={CREDENTIALS}' \
   -following-database-uri 'awsdynamodb://following?partition_key=Id&allow_scans=true&region={REGION}&credentials={CREDENTIALS}' \
   -likes-database-uri 'awsdynamodb://likes?partition_key=Id&allow_scans=true&region={REGION}&credentials={CREDENTIALS}' \
   -messages-database-uri 'awsdynamodb://messages?partition_key=Id&allow_scans=true&region={REGION}&credentials={CREDENTIALS}' \
   -notes-database-uri 'awsdynamodb://notes?partition_key=Id&allow_scans=true&region={REGION}&credentials={CREDENTIALS}' \
   -posts-database-uri 'awsdynamodb://ap_posts?partition_key=Id&allow_scans=true&region={REGION}&credentials={CREDENTIALS}' \
   -date {YYYY-MM-DD} \
   -verbose

If you need/want to set up all those tables, the code to do so is here:

https://github.com/sfomuseum/go-activitypub/blob/main/cmd/create-dynamodb-tables/main.go

And:

https://github.com/sfomuseum/go-activitypub/tree/main/app/dynamodb/tables/create

For example:

go run -mod vendor cmd/create-dynamodb-tables/main.go \
    -dynamodb-client-uri 'awsdynamodb://?region={REGION}&credentials={CREDENTIALS}'

Note that all of these URIs are using a custom awsdynamodb:// scheme. Under the hood all of this code is using the aaronland/go-aws-dynamodb and aaronland/go-aws-session packages in to support a wider range of declarative credentialing options than those offered by the default AWS libraries:

There are also Makefile targets for running everything locally against a containerizer DynamoDB instance:

Expected behavior

An iterator whose Next method triggers an io.EOF error.

Version

0.37.0

Additional context

Documentation (albeit incomplete) for the application's database design and specifics about the use of gocloud.dev/docstore are here:

The really long think-y think-y version is here:

vangent commented 6 months ago

This is hard to debug from here, can you use gohack and add some Printfs in the relevant code to see what's happening?

In here: https://github.com/google/go-cloud/blob/master/docstore/awsdynamodb/query.go#L472

Specifically it would be good to know what it.offset, it.limit, it.curr, it.last, it.count, it.limit, and len(it.items) is when this happens.

I think that this line: https://github.com/google/go-cloud/blob/master/docstore/awsdynamodb/query.go#L480 is supposed to handle this, but it's not.

It's possible that this is a regression in https://github.com/google/go-cloud/pull/3385.

@bartventer FYI

bartventer commented 6 months ago

I've been digging into this issue and although I was unable to reproduce the panic, I added a preventive check for awsdynamodb here. This check returns io.EOF if there are no more items, which should prevent an index out of range panic when Next is called after it has already returned EOF once.

I also added a test case to verify this behavior. The test repeatedly calls Next until it gets io.EOF, then calls Next again to ensure it still returns io.EOF.

I've pushed these changes in a PR. Please have a look when you get a chance and let me know if there's anything that needs adjusting.

thisisaaronland commented 6 months ago

Thanks for the fix. I can confirm that I am no longer seeing panics calling iter.Next – 

https://github.com/sfomuseum/go-activitypub/blob/iter-fix/go.mod#L5

Question: In the process of investigating the panics I often noticed that the docstore.Query iterators would return/report no results. For example if I call a simple shell script wrapper around the "counts" code like this:

> sh ./work/counts.sh 2024-03-18
{"time":"2024-03-21T16:16:56.081668-07:00","level":"DEBUG","msg":"Verbose logging enabled"}
{"time":"2024-03-21T16:16:57.021555-07:00","level":"DEBUG","msg":"Get counts for date","date":"2024-03-18","location":"America/Los_Angeles","start":1710720000,"end":1710806400}
{"date":"2024-03-18","location":"America/Los_Angeles","accounts":0,"blocks":0,"boosts":0,"deliveries":0,"followers":0,"following":0,"likes":0,"messages":0,"notes":0,"posts":0}

All the query iterators return nothing. But then if I call the same code immediate afterwards the queries return results:

> sh ./work/counts.sh 2024-03-18
{"time":"2024-03-21T16:17:03.270628-07:00","level":"DEBUG","msg":"Verbose logging enabled"}
{"time":"2024-03-21T16:17:04.100218-07:00","level":"DEBUG","msg":"Get counts for date","date":"2024-03-18","location":"America/Los_Angeles","start":1710720000,"end":1710806400}
{"date":"2024-03-18","location":"America/Los_Angeles","accounts":0,"blocks":0,"boosts":0,"deliveries":9045,"followers":1,"following":0,"likes":0,"messages":0,"notes":0,"posts":13}

The code in question is:

I am uncertain whether this is related to the iter.Next problem or something entirely different. If it's more helpful to create a separate issue I am happy to do that.

vangent commented 6 months ago

Are you using Offset at all?

vangent commented 6 months ago

and/or Limit?

vangent commented 6 months ago

@bartventer I think there's a bug in Offset that may be causing both of these issues.

Example:

thisisaaronland commented 6 months ago

I am not currently using either Offet or Limit.

vangent commented 6 months ago

Hrm that damages my theory, but I still think there's a bug in those.

@thisisaaronland it would be really helpful if you could provide some debugging info as requested above, as it seems like you can reproduce this easily and we can't. Maybe even comment out the new lines that fixed the panic so we can understand what's going on there too.

thisisaaronland commented 6 months ago

Sure thing. I may not be able to do that until tomorrow but I will get to it as soon as I can. Thanks for all the help so far.

vangent commented 6 months ago

Per AWS, "A Query operation can return an empty result set and a LastEvaluatedKey if all the items read for the page of results are filtered out."

I think we're not handling that correctly, which causes the empty results sometimes, as well as the panic.

@bartventer can you take a look at https://github.com/google/go-cloud/pull/3408 ?

@thisisaaronland can you try it out and see if that looks better?

thisisaaronland commented 6 months ago

I couldn't figure out what the replace directive for the changes in #3408 should be so I just brute-force cloned the query.go file in my vendor folder. With only limited testing it doesn't seem to have helped and may, in fact, have made things worse. Specifically:

I will carve out some time in the morning to add the debugging hooks you suggested (above) to try and get more concrete details to work with.

If it's at all helpful what I have observed is that the Query iterator seems to return no results the first time it is invoked but then subsequently it starts to return results, even though the queries themselves are happening in entirely new processes. So maybe there is something happening at the AWS/Dynamo layer where the iterator returns prematurely on the first invocation and that, by extension, Dynamo queries (in AWS) exist in some kind of "window of time" between invocations ?

vangent commented 6 months ago

Can you try again, I made some changes. If it doesn't work let me know, printfs...

On Thu, Mar 21, 2024, 6:16 PM Aaron Straup Cope @.***> wrote:

I couldn't figure out what the replace directive for the changes in #3408 https://github.com/google/go-cloud/pull/3408 should be so I just brute-force cloned the query.go file in my vendor folder. With only limited testing it doesn't seem to have helped and may, in fact, have made things worse. Specifically:

  • Running the code (mentioned above) for the first time the Query iterator continues to return no results.
  • Running the code (mentioned above) a second time simply hung and never returned any results.

I will carve out some time in the morning to add the debugging hooks you suggested (above) to try and get more concrete details to work with.

If it's at all helpful what I have observed is that the Query iterator seems to return no results the first time it is invoked but then subsequently it starts to return results, even though the queries themselves are happening in entirely new processes. So maybe there is something happening at the AWS/Dynamo layer where the iterator returns prematurely on the first invocation and that, by extension, Dynamo queries exist in some kind of "window of time" between invocations ?

— Reply to this email directly, view it on GitHub https://github.com/google/go-cloud/issues/3405#issuecomment-2014139958, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFMCKTSPUUQCJOERCSE74TLYZOA55AVCNFSM6AAAAABE4FMXTGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJUGEZTSOJVHA . You are receiving this because you modified the open/close state.Message ID: @.***>

bartventer commented 6 months ago

Per AWS, "A Query operation can return an empty result set and a LastEvaluatedKey if all the items read for the page of results are filtered out."

I think we're not handling that correctly, which causes the empty results sometimes, as well as the panic.

@bartventer can you take a look at #3408 ?

@thisisaaronland can you try it out and see if that looks better?

I've had a look at PR #3408 and I agree with the logic there. I've created a new PR (#3409) that builds on that work. @vangent, could you take a look when you get a chance?

thisisaaronland commented 6 months ago

Just a quick note that I applied github.com/bartventer/go-cloud v0.0.0-20240322082227-9d139c92796d and I am not seeing time-outs but I am still seeing inconsistent behaviour:

https://github.com/sfomuseum/go-activitypub/blob/iter-fix/go.mod#L7

Specifically: The first time the query is run it will sometimes return 0 results but then if the query is run again inside a short window of time (but in an entirely new process) the query will return results. That's all I've been able to do so far but will try to carve out time to get more concrete debugging information soon.

vangent commented 6 months ago

Ping @thisisaaronland

vangent commented 3 weeks ago

Any updates here @thisisaaronland ?