teragrep / pth_06

Teragrep Datasource for Apache Spark
GNU Affero General Public License v3.0
0 stars 6 forks source link

Add support for Cassandra querying #13

Closed Tiihott closed 11 months ago

Tiihott commented 1 year ago

First create a proof of technology for implementing Cassandra querying to pth_06. If all the requirements for the pth_06 functions that will use Cassandra querying are met, then continue to implement the Cassandra querying to pth_06.

Tiihott commented 1 year ago

pth_06 uses wildcards in query strings. Cassandra does not support wildcards natively, to use wildcards properly in Cassandra requires the use of full text search engine like SOLR which Datastax DSE uses alongside Cassandra.

Tiihott commented 1 year ago

Support for LIKE queries (and use of wildcards) in Cassandra is possible to achieve with SASI (SSTable Attached Secondary Index). https://docs.datastax.com/en/developer/java-driver/4.17/manual/query_builder/schema/index/ https://docs.datastax.com/en/developer/java-driver/4.3/manual/query_builder/relation/ https://cassandra.apache.org/doc/latest/cassandra/cql/SASI.html http://www.doanduyhai.com/blog/?p=2058#sasi_perf_benchmarks

Tiihott commented 1 year ago

After testing SASI in the pth_06 Cassandra-branch it seems like SASI is a viable option for implementing wildcard search to the Cassanda queries in an almost identical way that they are implemented in pth_06 mariadb and s3 queries. Because of limitations of SASI and Cassandra that are stated in above linked sasi_perf_benchmarks, it is not recommended to use SASI indexing (CONTAINS mode) on columns with long strings like the payload column because of performance and disk space usage issues. It is also recommended to avoid using substring search in Cassandra queries in general.

If using SASI is not an option because of Cassandra cluster configuration etc, the Cassandra condition walker has to exclude the wildcard usage when constructing the query condition (a list of cql relations that can be appended to the where clause of the cql query).

Tiihott commented 1 year ago

Missing features of Cassandra that are needed to be implemented in the software side:

This may change in the future releases of Cassandra, but for now OR and '!=' are not supported: https://cassandra.apache.org/doc/stable/cassandra/cql/SASI.html#limitations-and-caveats

Not Equals and OR support have been removed in this release while changes are made to Cassandra itself to support them.

Tiihott commented 12 months ago

Looking into Apache Druid as an alternative to Apache Cassandra, because of the limitations of Cassandra which are stated above. Druid should have much better filtering/search functionality that Cassandra lacks.