[Feature idea] S3 large object compression

linagora / james-project

Mirror of Apache James Project

Apache License 2.0

70 stars 63 forks source link

[Feature idea] S3 large object compression #5098

Open chibenwa opened 6 months ago

chibenwa commented 6 months ago

What?

Apache James stores large objects in S3, and we can store significant amount of data.

One way to reduce the cloud bill is to deduplicate (2-4x cost savings) but even deduplicated on a typical workload we have 6TB of data for 3000 users.

At OVH this costs 1440E per month when replicating to 2 region, not including marginal transfer costs.

While not being that much we could reduce the bill further by a factor 2-3x by compressing large objects before putting them in S3.

How?

Have a property "compression threshold" in blob.properties, default disabled.

If uploading a file larger than this threshold, james would compress it before uploading it.

The read path would use that metadata to see if uncompression is needed.

Current thought

Trading long term disk storage vs CPU on read/writes.
The savings generated are not large enough to cover development costs of this feature.

quantranhong1999 commented 6 months ago

additional idea: move long-lived objects to cheap storage like S3 Glacier?

AWS supports auto remove long-lived objects to S3 Glacier. I am not sure if OVH supports that. It seems OVH has "Cold Archive" - not sure how ready it is.

chibenwa commented 6 months ago

additional idea: move long-lived objects to cheap storage like S3 Glacier?

Yes this could be an idea. However what would be an implementation plan for this?

vttranlina commented 6 months ago

Can we check the distribution of file types for 6TB? (Eg: 30% txt file, 20% mp4...) Then we can try compressing the files that occupy the highest percentage of storage locally (simply by tool/command) and see if it's effective

quantranhong1999 commented 6 months ago

Yes this could be an idea. However what would be an implementation plan for this?

It seems OVH does not support the lifecycle policies for S3 storage classes yet. cf https://github.com/ovh/public-cloud-roadmap/issues/210

Then it is not available on OVH yet and we would need to do it on the application side. Maybe a cron job to move objects to a cheaper storage class, and a middleware service to resolve the mapping from the original S3 object id to the new cheap object id would do the job.

But I am not sure it is worth the implementation TBH. It should depend if it is worth the OVH bill.

chibenwa commented 6 months ago

Can we check the distribution of file types for 6TB?

Even compressed files would be encoded on base64 which means you gain at least the 33% of the encoding...