sio2project / oioioi

GNU General Public License v3.0
161 stars 72 forks source link

Support for S3 storage backend #276

Open DietPawel opened 11 months ago

DietPawel commented 11 months ago

Context:

Currently OIOIOI uses filetracker as a storage backend. It is custom implementation of standard Object Storage. It is more or less compatible with S3. More than half of recent Szkopuł outages were due to filetracker2 instability. They are probably related to inefficient python implementation and resource usage, which trigger OOM.

In the past filetracker (1) used lighthttpd server for serving files and python endpoint for write operations. The storage needs turned out to be huge and solution which helped compress data several times was file-level deduplication. Unfortunately serving files from python just does not make it in our production environment.

Idea:

Add support to S3-like storage backend. This way we can stop maintaining filetracker2 server. Again It does not support any disaster recovery and is just unreliable and we have little resources to engineer another Object Storage solution.

Issues:

  1. S3 typically does not support dedup. Which means we have to either: a) Implement it client-side b) Add dedup as a S3 middleware c) (get rid of deduplication (and delegate deduplication to S3 server's filesystem)

  2. Some migrations strategy (and tool) would be great to have

  3. Not only OIOIOI uses filetracker but also sioworkers. Change must be made in both projects at the same time or some temporary compatibility layer is needed.

Please comment Your thoughts about this below!

A-dead-pixel commented 11 months ago

More than half of recent Szkopuł outages were due to filetracker2 instability.

Did that instability manifest, by any chance, in random DB corruption(?) requiring a restart of the filetracker server? Like HTTP/500: (-30973, 'BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery -- BDB0060 PANIC: fatal region error detected; run recovery').

At least for my deployment, simply updating filetracker's dependencies seemingly fixed this, though checking the memory usage didn't come to my mind. I should still have at least one problem package that allows for somewhat reproducing this error locally.

DietPawel commented 11 months ago

I parsed through last year of logs and found no signs of corruption like that. It found it easy to crash the filetracker2 on purpose with a rejudge when number of available cores is about 200 and concurrency is high enough to utilize that. I was unable to do that with filetracker1.

A-dead-pixel commented 9 months ago

Please comment Your thoughts about this below!

For me and Stowarzyszenie Talent, as long as decent options for self-hosting exist, switching to an s3-like backend shouldn't be an issue apart from the migration. Based on my short research, MinIO and SeaweedFS seem ok. Localstack looks more tailored to local development than production deployments, but should be enough for a development environment. Additionally, it's written in python and implements a multitude of services apart from s3, so resource usage might not be perfect.

Is the current plan for SZKOpuł and sio2.mimuw to use a cloud s3 or a self-hosted solution?

DietPawel commented 9 months ago

Is the current plan for SZKOpuł and sio2.mimuw to use a cloud s3 or a self-hosted solution?

This really depends how it will be implemented. In case no deduplication will take place, then It would be too expansive to put in in the cloud and the latency might by also an extra issue. To all to the options for selfhosting I would consider also ceph.

The main goal is to get something more mature for storage backend. As for the migration, I think a simple translation layer from Filetracker to S3 should allow to do a downtimeless upgrade.

metenn commented 7 months ago

Did that instability manifest, by any chance, in random DB corruption(?) requiring a restart of the filetracker server? Like HTTP/500: (-30973, 'BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery -- BDB0060 PANIC: fatal region error detected; run recovery').

As far as I'm aware this is intended filetracker behavior, when a single process gets terminated the entirety of filetracker needs to restart for reasons that are hidden behind a 403 site... And that causes those error messages in particular.

At least for my deployment, simply updating filetracker's dependencies seemingly fixed this, though checking the memory usage didn't come to my mind. I should still have at least one problem package that allows for somewhat reproducing this error locally.

That's interesting! Could you please send that problem package my way if you still have it by any chance?

A-dead-pixel commented 7 months ago

That's interesting! Could you please send that problem package my way if you still have it by any chance?

Unfortunately, I didn't manage to trigger that bug now with the package I used before both on new and older (April 2023) oioioi. It could be that it was somehow specific to my deployments/forks, or just that some specific outside condition disappeared. It wasn't deterministic at all, as like only 1/2-1/3 package uploads were failing on my system. If I will find a package that allows for reproducing this issue, I will contact you.

Regardless, there probably wasn't much to be found, as I already wasted many hours trying to investigate it and arrived at "Fatal Python error: GC object already tracked", after which I gave up.

MasloMaslane commented 5 months ago

Currently we are planning to create a proxy between filetracker clients and S3 (the repo). The proxy will manage deduplication and there will be no changes required in current filetracker clients. Online migration will also be possible: if a file is not found on S3, the proxy asks the old filetracker server (similar to migration from filetracker 1 to filetracker 2).