Open codefromthecrypt opened 8 years ago
We added off-heap memtable allocation of 20G which reduced the # of flushes and resulted in lesser compaction.
I guess I was thinking that since there is some useful code around the schema loading, like in zipkin.storage.cassandra.Schema
, it might be good if it were somehow extensible e.g. if you could provide your own schema or 'upgrade schema' file, and/or modify some of the parameters in the default schema like replication factor.
The only way to do this is somewhat basic.. put said file in front of the classpath!
in the case of docker you'd overwrite the file at /zipkin/cassandra-schema-cql3.txt
doing arbitrary upgrades could be dodgy. there's careful logic about the upgrade, and it checks for very certain things because CQL can't do everything. A log message might be misleading if we used this check, but did something else.
ex. "/cassandra-schema-cql3-upgrade-1.txt" has this check
static boolean hasUpgrade1_defaultTtl(KeyspaceMetadata keyspaceMetadata) { // TODO: we need some approach to forward-check compatibility as well. // backward: this code knows the current schema is too old. // forward: this code knows the current schema is too new. return keyspaceMetadata.getTable("traces").getOptions().getDefaultTimeToLive()
0; }
We have tests to show the effects of this work etc, but arbitrary things aren't something we could promise and therefore unlikely to be able to support.
I'd recommend only replacing the semantic contents of the existing schema files for this reason. Also, there's a lot of folks who use cassandra.. maybe there are other tools available to keep schema up to date which don't require zipkin's ENSURE_SCHEMA feature?
Increasing RF to 3+ is important in production.
But I don't know what's best way to do that without breaking dev environments. Currently there is the warning printed, ref https://github.com/openzipkin/zipkin/blob/master/zipkin-storage/cassandra/src/main/java/zipkin/storage/cassandra/Schema.java#L43
Other important things to do to a problem environment are
@adriancole is this still relevant ? If yes i can search around the issues and put up some 'hints' in the documentation like above, as well as a warning about the provided Cassandra schema that sites should really not rely on the 'demo' schema configuration we provide. We should not become Cassandra tweaking experts though, merely hint that sites are responsible for squeezing the most out of their storage, and we'll just tell them what is important in terms of zipkin storage and indexing needs.
If not and all this is hopelessly outdated, feel free to close :-)
we could probably handle replication factor as an ENV variable as we do in elasticsearch, and leave it at that for now.
allright i can give that a go after you landed the DataStax Driver 4.0 Mothership https://github.com/openzipkin/zipkin/pull/3246
from @mikewrighton
It's currently tribal knowledge that the built-in schema isn't ideal for all production environments. We take some steps that make it easier for tests to pass, etc.
It would be nice for users and also for benchmarkers to use more realistic schema settings.
@openzipkin/cassandra do you know of a list of things about the schema that would certainly need to change in a multi-node cassandra cluster in production? If you can enumerate them, I can help document and maybe we can brainstorm a "dev mode" flag or some such that makes test-level options not the default.