thelastpickle / cassandra-medusa

Apache Cassandra Backup and Restore Tool
Apache License 2.0
264 stars 143 forks source link

medusa list-backups works very slow #522

Open tabbi opened 2 years ago

tabbi commented 2 years ago

Project board link

Hello! i have quite big folder with cassandra backups which use 1.9TB of disk space, and backups for 1 year are stored in this directory and medusa list-backups works about 10-20 minutes to show the list of backups, any ideas to fix that?

┆Issue is synchronized with this Jira Story by Unito ┆Issue Number: MED-40

rzvoncek commented 1 year ago

I spent some time looking into this. I managed to get the listing of backups in our GCS bucket from ~1:30m to ~8s.

The commit/branch is here: https://github.com/thelastpickle/cassandra-medusa/commit/e451049a2596f5ef7fbee37ee0d67a9cafea3593

The solution, in short, is to make the Medusa use asyncio (~ async def) throught the code, and not just in abstract_storage and its children. Then we can neatly list all the blobs (which takes about 5s) and then scatter/gather the cluster backup statuses (which read a tokenmap blob to work out node count each).

It was a lot of changes to do just this and I currently don't have the room to do this for other Medusa commands. Unles someone else picks this up, it'll have to wait for a refactoring week.

tabbi commented 2 months ago

Hello! this issue is still open right? bcoz updating to latest 0.22.2 version didn't solve that issue :( the medusa list-backups command tooks us more than 1 hour to complete

rzvoncek commented 1 month ago

Hi, yes, this is still and issue 🥲