onelson / estuary

33 stars 7 forks source link

Add a (simple) Database #24

Open onelson opened 3 years ago

onelson commented 3 years ago

Right now, cargo sends a lot of data which is discarded with each publish given that the index data tracked by git does not provide storage for it.

For example, one big benefit towards the usability of Estuary's frontend would be to capture the readme content, description, keywords, links to git repos, documentation sites, and timestamps for when the publish occurred.

In addition to storing this content, many databases offer extensions that can be leveraged for full-text search. This could be used to improve the features, storage, and overall performance of the search endpoint (#23).


A loose design goal should be to offer a simple deployment option not requiring an external database (if possible), but plan for the latter as a secondary option. The rationale being storage options like SQLite are not ideal for deployments involving shared disk, which can impact how you'd deploy Estuary (ex: taking steps to ensure more than one running instance isn't looking at the database concurrently).

At this point, I'd say we should pursue something along the lines of SQLite with postgres as a follow-up. We should briefly do a search to see if there's a "better than SQLite option." I'm not sure if sled offers disk persistence or if it's purely in-memory as I believed it to be, for example. In JVM space, H2 is a thing - maybe rust has a driver available. There might be others.

There might be other options to investigate for remote databases, such as redis, but we'll look at those later still.

With this in mind, the modules in Estuary's source should be arranged with feature selection in mind.

Private indexes for other languages such as verdaccio and devpi appear to use "files on disk" database implementations with similar deployment concerns as you'd see with SQLite, so I imagine this is a decent place to start.

onelson commented 3 years ago

I took a look at some of the docs for sled, and at this stage I'm not feeling comfortable adopting it for Estuary. The plan, for now, is to rely on SQLite via the rustqlite crate. Since the API offered by this crate aims to approximate rust-postgres, it may help to set us up for our 2nd pass adding external db support.

rickwebiii commented 2 years ago

For what it's worth, we chose Estuary because it did not require a database to use. We wanted something dead simple with nearly zero administration overhead. Our use case is to deliver a known set of packages under our control for private betas and pre-releases. We arrived at Estuary running in a docker container on AWS behind an nginx HTTPS reverse proxy.

A few characteristics of our use case:

If the database runs in-proc or as a subprocess as an implementation detail, that's probably okay, but I don't want to configure or administer it and I certainly don't want additional attack surface beyond the registry REST API.

onelson commented 2 years ago

@rickwebiii good notes. Thanks for that!

Today, Estuary drops a lot of the information sent by cargo during a publish since we simply don't have a good way to record it. Ideally, we'd capture a lot more, retain, and present where possible.

Keeping things simple with little config required is certainly an aim, but I sort of expect many of the planned features I'd originally hoped to deliver (like passive crate caching, stats, ownership tracking, or even capturing readme info) will require something to persist state. Likely this will add some complexity somewhere, but the question is how much and where.

A certain amount of the data could be built on the fly if we're willing to peek inside the package artifacts, so I'll be sure to consider that as an option of interest for your specific use case. Nothing is decided on for this yet.