response batching / cursor / lazy queries

fbernier commented 4 years ago

Hey,

It does not seem possible right now but it would be nice if we could fetch rows in batches for very large responses. I don't know the exact Vertica facilities that could be leveraged to achieve this but I guess is a way.

Thanks for your work!

huebnerr commented 4 years ago

I agree that reading the entire result set into memory isn't necessarily as scalable as it should be. :) The Vertica protocol doesn't support a "pull next x rows from the cursor/portal and stop" call, so there are a few options:

Leave the result data on the line until we've run out in the Rows object. This could work, but I worry about "session locking" the client side so we don't introduce another request until all of the data has been received.
Read all of the result data and cache it locally, then let the Rows object pull from there. This introduces questions about how/where to safely store the row data until its pulled in, its security. Overall, this feels like the better option.

I'll sync up with @tomwall and @sitingren and we can make a decision on this.

Thank you, @fbernier, for your support!

huebnerr commented 4 years ago

@fbernier I've begun prototyping the feature. In it, you create a VerticaContext (introduced recently) that can set a maximum in-memory row set size. Beyond that limit, everything gets placed in a temporary file and is paged-in on read where the page size is the same as the limit size. Once the rows object is Close()'ed, the backing is removed.

By setting the limit to 0, file caching is disabled altogether (the default).

    vCtx := NewVerticaContext(context.Background())
    vCtx.SetInMemoryResultRowLimit(1)

    rows, _ := connDB.QueryContext(vCtx, "SELECT * FROM v_monitor.cpu_usage")
    defer rows.Close()

How does this sound to you? Would you be willing to help try out and refine the feature once it's ready?

p.s., I've verified that the TempFile created by Go is automatically created with rw only for User and in the system default Temp directory. As long as the data can only be read by the process that created it, we should be good.

fbernier commented 4 years ago

@huebnerr I'm definitely checking this all out and testing it today. Thanks!

huebnerr commented 4 years ago

No reason to test yet as it's not fully implemented yet. I'll let you know when it's ready. Does the concept meet your needs?

On Nov 22, 2019 7:16 AM, François Bernier notifications@github.com wrote:

@huebnerrhttps://github.com/huebnerr I'm definitely checking this all out and testing it today. Thanks!

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/vertica/vertica-sql-go/issues/43?email_source=notifications&email_token=AK56SCI66THTGAERMV56GDDQU7ZOFA5CNFSM4JPMLL22YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE55PUI#issuecomment-557570001, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AK56SCMDOXC6TUXTVQESVP3QU7ZOFANCNFSM4JPMLL2Q.

fbernier commented 4 years ago

@huebnerr I think it definitely sounds good, yes!

fbernier commented 4 years ago

So I've been thinking about this a bit more and I feel like your first proposition might be the best one. If I understand correctly what you meant, we would basically recvMessage as we iterate using rows.Next which seems to be what this postgres driver is doing. We would have to lock, and the release the connection when rows.Close() is called, but that way the driver would not build the whole resultset in memory at any time. I guess that requires heavier changes to the current code than your other proposition. Let me know what you think.

huebnerr commented 4 years ago

@fbernier Sadly Vertica's current wire protocol doesn't behave the same way as Postgres' and blocking on the reads would actually cause problems from the Go usage semantics perspective.

In the case where rows are left on the wire, iteration hasn't been completed, and another non-rows call is made (e.g., another query/execute), the system would either hang (in the case of mutexes) or return an error (bad state).

After discussing this with @tomwall, it seems the best route is to continue with the spooling. This will keep the operation atomic, be transparent to the user and keep everything copacetic.

fbernier commented 4 years ago

Hey that makes sense. I'm ready to test as soon as you have something. Thanks!

huebnerr commented 4 years ago

Haven't forgotten. Will return after Thanksgiving!

huebnerr commented 4 years ago

@fbernier, I've committed and pushed the first implementation to the following branch:

https://github.com/vertica/vertica-sql-go/tree/rhuebner-issue-43-result-batching

If you wish, please help me try it out and see if it meets your usage requirements. It should theoretically be able to handle infinite data sets. Please see usage in the driver_test.go file function TestEnableResultCache() or check the description above. If this meets your requirements, I'll harden what's there and finalize the testing.

Thanks

fbernier commented 4 years ago

Hey, I'm sorry I had missed that reply. I just tried it and I still have more testing to do but it seems to work! I've yet to read the code yet but I have a few concerns on the implementation:

1- When looping over rows.Next(), does it wait to have saved the whole query to the tempfile before reading from it, or does it start reading from it right away until it reaches an EOF marker? I have a feeling it's the former and I think we could do the latter. The goal of this feature is mainly memory related but performance is next and once again if we can start working on the data as it arrives, for example for streaming purposes, it would be a nice improvement.

I will report on anything else as I test.

Thanks!

huebnerr commented 4 years ago

Thanks, @fbernier. The case is definitely the former. Again, this is a quirk of the Vertica protocol. When you have a query running, the messages come back in this format:

Metadata
Results
Finished

The current driver waits for all three steps to finish before returning from the Query() call. The issue, as if the paging discussion above, is that the current Vertica wire protocol supports only one active portal/cursor at a time. So if we were to return control early (say, while 2 is going on) and another Query()/Execute() gets run, Vertica would return an error. We have to retrieve all of the data before returning control to the user to "cleanse the palate".

The other option, as I understand from Vertica maestro @tomwall, is to leave the query results on the line, return early, and watch for other disruptive RPC calls. In the case of those, go ahead and spool all of the remaining data at that point, then go ahead. This is somewhat tricky, complicated and error-prone.

On the other hand, Vertica's roadmap includes adding improved portal semantics. Once that's done, we have many more options. But sadly, for now, it has to be pretty one-shot.

fbernier commented 4 years ago

ah yeah I understand... oh well.

As a comparison point, for a query returning 22M rows, it takes ~3.3minutes without the file paging and ~6minutes with file paging. I'm going to profile it and I'm sure we can make this be pretty much on par. Paging does help with memory though obviously which is great!

huebnerr commented 4 years ago

Please do take a look at it. It's a first-pass implementation and certainly can be improved performance-wise! Check the sections marked "TODO: really inefficient!" and "TODO: optimize". :)

huebnerr commented 4 years ago

If you think it's okay, @fbernier, I'm going to go ahead and close this issue as a 'feature', merge it in, and open a new issue as an enhancement for optimization.

fbernier commented 4 years ago

I'm totally fine with this, yes. Go ahead and thanks again!

huebnerr commented 4 years ago

Opened as PR: https://github.com/vertica/vertica-sql-go/pull/49

huebnerr commented 4 years ago

This actually seems to already be addressed and merged in but I neglected to close the Issue.

vertica / vertica-sql-go

response batching / cursor / lazy queries #43