Epic: Upload/download snapshot files between S3 and page server

hlinnaka commented 3 years ago

Goal is that if a meteor strikes a page server, it can be fully reconstructed from the files stored in S3. S3 is authoritative source of truth; in a sense, the page server is only a cache of what's stored in S3.

It's not clear how e.g. branch creation should work in this model. Should the control plane create the branch in S3, and the page server automatically notices that the new branch appeared? Or is there an API between the control plane and page server to communicate such things. This needs some design work. See also https://github.com/zenithdb/rfcs/blob/7810fcadbcd140241cd595304df1eeb7a7dd6718/snapshot-first-storage.md#cloud-snapshot-manager-operation

[x] #563
[x] #564
[x] #755
[x] #756
[x] #845

hlinnaka commented 3 years ago

There's a similar issue at https://github.com/zenithdb/zenith/issues/230. That was underspecified, and ISTM it evolved into something more like import/export. I'm opening this new issue to track specifically the idea that the uploading/downloading to S3 is based on the immutable snapshot files, as Eric alluded to in this comment: https://github.com/zenithdb/zenith/issues/230#issuecomment-865111139.

kelvich commented 3 years ago

Should the control plane create the branch in S3, and the page server automatically notices that the new branch appeared? Or is there an API between the control plane and page server to communicate such things.

There were some recent discussions about branch/tenant creation and since it is mentioned in this issue I'll respond here.

Right now it is done by direct communication between console and page server ('branch_create' command in page_service.rs).

The whole concept of passing information between closely located console and page server not directly, but through remote fs-like storage instead seems weird to me, but let's imagine we are doing it this way.

Let's also keep in mind that we are trying to create compute node in a matter of seconds. Right now nothing in principle prevents us from doing this within a 1-2 seconds timeframe.

There is a tricky part about 'page server automatically notices that the new branch appeared' -- notices how? I can see 2 options here: a. Use S3 Event Notifications service. Right on the first page of this service, there is a statement that Typically, event notifications are delivered in seconds but can sometimes take a minute or longer. We can just disqualify that options based on latency characteristics. b. Notify page server directly from console after creating a branch in S3. That kind of invalidates any need in a previous console communication with S3 because we anyway going to contact the page server about this. And the page server anyway will need to go to S3 at some point in time.
Rails exposure to our storage format.

We would need to understand directory structure and file formats in the rails app. That is some amount of code that needs to be in sync with the same code in the page server. Also, that would complicate deployment since we would need to update them both if we need to change something in file format.
Page server-only information. Some information about branches/tenants is needed for page server operations only (quotas, GC settings, etc) and there is no need for it to be stored at S3 or even be stored on disk in some kind of custom format since we have it in console DB.

Latency. I just have run these commands from ec2 in us-west-1 (both VM and s3 in the same AZ):

$ time aws s3 mb s3://2136481274361289480
make_bucket: 2136481274361289480

real    0m0.971s

$ echo 'hello world' > test

$ time aws s3 cp test s3://2136481274361289480/test
upload: ./test to s3://2136481274361289480/test                 

real    0m0.383s

$ time aws s3 ls s3://2136481274361289480/
2021-07-19 09:03:45          8 test

real    0m0.369s

compare that with direct communication with page server:

admin@ip-172-31-66-237:~$ time psql -h 172.31.79.212 -p 6400 -c 'status'
    data     
-------------
hello world
(1 row)

real    0m0.012s

So this way we would just add about a second to our compute node start which is quite noticeable.

So to summarize: S3 is cool and cheap storage but trying to use it as a substitute for a database or network connection only creates problems.

petergeoghegan commented 3 years ago

Goal is that if a meteor strikes a page server, it can be fully reconstructed from the files stored in S3. S3 is authoritative source of truth; in a sense, the page server is only a cache of what's stored in S3.

In what sense is that true, and in what sense is that not true? This seems important to me.

ericseppanen commented 3 years ago

A few responses:

page server automatically notices that the new branch appeared

I'm not sure why we'd expect it to. If a new branch was created, I would think that was either done by the control plane, or commanded by the control plane; that control plane can then notify the right pageserver.

Rails exposure to our storage format.

We would need to understand directory structure and file formats in the rails app. That is some amount of code that needs to be in sync with the same code in the page server.

I agree, that seems tricky. My first thought would be to build an FFI rust crate that has access to the functions that understand directory structure & file formats. A less powerful option would be a standalone rust binary that can run the necessary operations as a command-line tool.

s3 Latency

I agree that creating a new timeline in s3 would add some delay, though connecting to an existing timeline would not. There might be a few workarounds for this:

Create the timeline concurrently with pageserver start.
Create empty timeline files ahead of time and assign them as new database creations are requested.

If new database creation time is a critical metric, then maybe it's OK to allow the pageserver to run only that operation in a non-durable manner. I don't think it's a good idea to give ourselves that kind of freedom (to create non-durable state) in general, though.

kelvich commented 3 years ago

I'm not sure why we'd expect it to. If a new branch was created, I would think that was either done by the control plane, or commanded by the control plane; that control plane can then notify the right pageserver.

Right. But that thing creates logical trouble for me: why then console talked to S3 in the first place if it anyway need to talk to page server API after that. Why don't we put corresponding code to the page server in the same API call? This way page server would create corresponding metadata in S3 by himself. And this call could be synchronous regarding to S3 operation so we don't introduce any non-S3-backed state in the page server.

I agree, that seems tricky. My first thought would be to build an FFI rust crate that has access to the functions that understand directory structure & file formats. A less powerful option would be a standalone rust binary that can run the necessary operations as a command-line tool.

And what I'm suggesting is to put all that logic in the page server (and it need to know that formats anyway) and use http API calls to run these operations from the console.

Create the timeline concurrently with pageserver start.

I think you meant concurrently with postgres start. Yeah, that may work -- if we do that in parallel it may cover S3 latency. If that would be the case then we definitely should do metadata operations synchronously and avoid dealing with cases when pageserver crashed before creating metadata in S3.

mklgallegos commented 3 years ago

not sure who this should be assigned to, assigning to @hlinnaka. @hlinnaka can you delegate if you're not the right owner for this issue?

hlinnaka commented 3 years ago

@lubennikovaav presented some early thoughts on the design last week.

kelvich commented 3 years ago

@SomeoneToIgnore As I can see there are issues for the next parts. I assume that we can close that one

neondatabase / neon

Epic: Upload/download snapshot files between S3 and page server #340