neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
15.03k stars 439 forks source link

Epic: Upload/download snapshot files between S3 and page server #340

Closed hlinnaka closed 3 years ago

hlinnaka commented 3 years ago

Goal is that if a meteor strikes a page server, it can be fully reconstructed from the files stored in S3. S3 is authoritative source of truth; in a sense, the page server is only a cache of what's stored in S3.

It's not clear how e.g. branch creation should work in this model. Should the control plane create the branch in S3, and the page server automatically notices that the new branch appeared? Or is there an API between the control plane and page server to communicate such things. This needs some design work. See also https://github.com/zenithdb/rfcs/blob/7810fcadbcd140241cd595304df1eeb7a7dd6718/snapshot-first-storage.md#cloud-snapshot-manager-operation

hlinnaka commented 3 years ago

There's a similar issue at https://github.com/zenithdb/zenith/issues/230. That was underspecified, and ISTM it evolved into something more like import/export. I'm opening this new issue to track specifically the idea that the uploading/downloading to S3 is based on the immutable snapshot files, as Eric alluded to in this comment: https://github.com/zenithdb/zenith/issues/230#issuecomment-865111139.

kelvich commented 3 years ago

Should the control plane create the branch in S3, and the page server automatically notices that the new branch appeared? Or is there an API between the control plane and page server to communicate such things.

There were some recent discussions about branch/tenant creation and since it is mentioned in this issue I'll respond here.

Right now it is done by direct communication between console and page server ('branch_create' command in page_service.rs).

The whole concept of passing information between closely located console and page server not directly, but through remote fs-like storage instead seems weird to me, but let's imagine we are doing it this way.

Let's also keep in mind that we are trying to create compute node in a matter of seconds. Right now nothing in principle prevents us from doing this within a 1-2 seconds timeframe.

So to summarize: S3 is cool and cheap storage but trying to use it as a substitute for a database or network connection only creates problems.

petergeoghegan commented 3 years ago

Goal is that if a meteor strikes a page server, it can be fully reconstructed from the files stored in S3. S3 is authoritative source of truth; in a sense, the page server is only a cache of what's stored in S3.

In what sense is that true, and in what sense is that not true? This seems important to me.

ericseppanen commented 3 years ago

A few responses:

page server automatically notices that the new branch appeared

I'm not sure why we'd expect it to. If a new branch was created, I would think that was either done by the control plane, or commanded by the control plane; that control plane can then notify the right pageserver.

Rails exposure to our storage format.

We would need to understand directory structure and file formats in the rails app. That is some amount of code that needs to be in sync with the same code in the page server.

I agree, that seems tricky. My first thought would be to build an FFI rust crate that has access to the functions that understand directory structure & file formats. A less powerful option would be a standalone rust binary that can run the necessary operations as a command-line tool.

s3 Latency

I agree that creating a new timeline in s3 would add some delay, though connecting to an existing timeline would not. There might be a few workarounds for this:

If new database creation time is a critical metric, then maybe it's OK to allow the pageserver to run only that operation in a non-durable manner. I don't think it's a good idea to give ourselves that kind of freedom (to create non-durable state) in general, though.

kelvich commented 3 years ago

I'm not sure why we'd expect it to. If a new branch was created, I would think that was either done by the control plane, or commanded by the control plane; that control plane can then notify the right pageserver.

Right. But that thing creates logical trouble for me: why then console talked to S3 in the first place if it anyway need to talk to page server API after that. Why don't we put corresponding code to the page server in the same API call? This way page server would create corresponding metadata in S3 by himself. And this call could be synchronous regarding to S3 operation so we don't introduce any non-S3-backed state in the page server.

I agree, that seems tricky. My first thought would be to build an FFI rust crate that has access to the functions that understand directory structure & file formats. A less powerful option would be a standalone rust binary that can run the necessary operations as a command-line tool.

And what I'm suggesting is to put all that logic in the page server (and it need to know that formats anyway) and use http API calls to run these operations from the console.

Create the timeline concurrently with pageserver start.

I think you meant concurrently with postgres start. Yeah, that may work -- if we do that in parallel it may cover S3 latency. If that would be the case then we definitely should do metadata operations synchronously and avoid dealing with cases when pageserver crashed before creating metadata in S3.

mklgallegos commented 3 years ago

not sure who this should be assigned to, assigning to @hlinnaka. @hlinnaka can you delegate if you're not the right owner for this issue?

hlinnaka commented 3 years ago

@lubennikovaav presented some early thoughts on the design last week.

kelvich commented 3 years ago

@SomeoneToIgnore As I can see there are issues for the next parts. I assume that we can close that one