Apache Solr for distributed indexing

maestre3d commented 4 years ago

Is your feature request related to a problem? Please describe. Yes, right now distributed data indexation is impossible and we still need to implement complex query building patterns in order to satisfy our indexation standards.

Describe the solution you'd like Currently, we use Redis as cache and PostgreSQL as main RDBMS, yet for full-text searching we need something more powerful and easy to implement. Using Apache Solr as distributed indexing database for the job is one of the best choices we have, even more if we are already using Apache Zookeeper (required by Solr) and Apache Kafka.

Describe alternatives you've considered Since we might use EKS (e.g. Elasticsearch) for other inquiries, we should consider using the ElasticStack.

Additional context No

maestre3d commented 4 years ago

Consider using Apache Solr (+ Apache Lucene + Apache Zookeeper), it's widely use and its easier to do full-text search even in docs like xml and PDFs.

Since we're already using Apache Kafka and most likely Apache Hadoop and Hive, we still must use Apache Zookeeper for sharded clusters management anyway.

maestre3d commented 4 years ago

Now reconsidering using Elasticsearch.

Here is the full explanation: Currently, we have planned to deploy almost everything on AWS. While there is an Open Source service of Elasticsearch in AWS, there's no managed service for Apache Solr available in AWS. Even though our platform is using Kafka, we would be using MKS (Managed Kafka Service) in AWS. Thus, we wouldn't need to manage our Zookeeper cluster by ourselves since AWS takes charge. The same goes for Apache Spark and Hadoop+MapReduce.

In addition, if we chose Solr anyways, we would have to manage all EC2 instances along with AutoScaling groups, Service Discovery registry and Load Balancing.

Hence, we know this will eventually come with a price and we still consider an Apache Solr migration in a future for certain services that could take an advantage of the trade-offs Apache Solr has to offer (like fine-grained search on media/blob service's files).

maestre3d commented 4 years ago

We should use the CQRS pattern whenever robust querying is needed from now on.

Why? By segregating queries and writes from code, we gain the option to use different databases for each operation. For example, write commands could be writing to a PSQL database with more structured/normalized data, or maybe we would want to store our aggregate as JSON, so we could take advantage of Apache Cassandra or AWS DynamoDB.

In the other hand, for the query side, we could use a fine-grained indexation/querying database system like the ones we've been discussing before and get the benefits of full text search in NRT (Near Real Time).

Thus, CQRS gives us a way to model complex business logic, avoiding the use of simple CRUD operations.

Here is a diagram that explains this pattern too with a single data source. This could be used whenever a complex domain modeling is needed (avoiding simple CRUD operations). Simple CQRS

Now, this diagram shows us the way we could be using this pattern in order to segregate our data sources into particular data sets that can be scaled independently. This could be used whenever complex querying is needed like Media and Author services. Read-Write separated data with CQRS

Finally, there is another way to handle our data using CQRS. That way is using Event Sourcing (ES), which aggregates data instead of replacing it like conventional ways. It's events are immutable as in real life events are. This is highly used when data is constantly changing and you must keep track of it for different reasons. Moreover, ES brings us the possibility of easy rollback because we can trace the previous state of our aggregate using the event store, so we could rebuild aggregates. Here is a diagram that shows how CQRS + ES can be combined. CQRS with Event Sourcing

By the time, we are still turning down the need of full Event Sourcing (ES) since we already have a way to successfully handle distributed transactions (SAGA). However, we know we still must tune our transaction system even more to be completely resilient and simple (using Orchestration instead Choreography, tuning the Event Server from Alexandria Core to use distributed tracing, metrics, circuit breakers, retry policies and logging by default using the chain of responsibility pattern for example).

Although we take advantage of an Event-Driven ecosystem to communicate to other services asynchronously, we are still evaluating the need of a dedicated event store, mostly for rollbacks/compensating operations.

More information can be found here.

maestre3d / alexandria

Apache Solr for distributed indexing #27