scylladb / scylla-tools-java

Apache Cassandra, supplying tools for Scylla
Apache License 2.0
53 stars 85 forks source link

sstablemetadata is extremely slow #174

Open dyasny opened 4 years ago

dyasny commented 4 years ago

I am scanning a moderate number of sstable files from a backup (~3500 files with every 9th file sampled) using two methods. One is a direct reader of the binary data written in Python, and the other is sstablemetadata.

Scanning the entire set takes ~45 minutes when using sstablemetadata, and ~5 seconds when using the python script.

The script only looks for and returns the token ranges of course, but this is still a huge difference.

dorlaor commented 4 years ago

Doesn't sstablemetadata do tons of work, like tombstone calculation: https://docs.datastax.com/en/dse/6.0/dse-admin/datastax_enterprise/tools/toolsSStables/toolsSSTableMetadata.html

You can suggest an enhancement to add a flag where it will just emit the ranges

On Tue, Jun 16, 2020 at 9:56 AM Dan Yasny notifications@github.com wrote:

I am scanning a moderate number of sstable files from a backup (~3500 files with every 9th file sampled) using two methods. One is a direct reader of the binary data written in Python, and the other is sstablemetadata.

Scanning the entire set takes ~45 minutes when using sstablemetadata, and ~5 seconds when using the python script.

The script only looks for and returns the token ranges of course, but this is still a huge difference.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scylladb/scylla-tools-java/issues/174, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANHURNQ6YQ2XRDISPHOQMLRW6P5TANCNFSM4N7ZVXLQ .