[EDU, PUB] Switch to 1 index: Language analyzers should work per-field not per-index

fako commented 1 year ago

Currently we create OpenSearch index configurations per language. This was once a recommendation from SURF Sara team. We can perhaps ask why and what kind of problems may arise from creating a single configuration for all languages and then handling language on a field-by-field basis. However it's quite certain our current approach has some problems as described with these issues:

fako commented 5 months ago

This is a GPT response that has been fine-tuned to help with ElasticSearch related questions. I think the recommendations make sense. @peterdubbeldsurf

The strategy of segregating documents into language-specific indexes versus consolidating them into a single index with language analyzers specified per field hinges on several factors, each with its own set of benefits and potential drawbacks. Understanding these can help you make informed decisions about the best architecture for your project's needs. Let's explore the advantages of both approaches:

Language-Specific Indexes (e.g., `index-nl`, `index-en`, `index-de`)

Optimized Search Performance: Queries can be faster because each index is smaller and more focused, containing only documents of a specific language. This can lead to quicker search and indexing operations, especially when the query is language-specific.
Customized Analysis and Optimization: Each index can have its own set of analyzers, tokenizers, and filters tailored specifically to the language's nuances. This allows for more precise control over the search and analysis process, potentially improving relevancy and accuracy.
Simplified Management for Language-Specific Data: If there's a need to update mappings, settings, or analyzers for a particular language, having separate indexes makes this process more straightforward and reduces the risk of inadvertently impacting documents in other languages.
Scalability and Resource Allocation: It's easier to allocate resources and scale the infrastructure for individual languages based on demand. For instance, a more frequently searched language can be allocated more resources.
Security and Access Control: Easier to implement language-based access control policies. If there are requirements to restrict access based on language, separate indexes can facilitate this.

Single Index with Language-Specific Analyzers

Unified View: A single index provides a unified view of the data, making it simpler to perform cross-language searches and aggregations. This can be particularly beneficial for multinational or multilingual applications.
Reduced Overhead: Managing one index instead of multiple reduces the complexity and overhead associated with index management, such as creation, deletion, and maintenance operations.
Flexible Language Support: It's easier to add support for new languages by updating the single index's settings to include new analyzers, rather than creating and managing a new index.
Resource Efficiency: A single index might use resources more efficiently by avoiding the duplication of overhead associated with each index, such as metadata storage and management processes.
Consistent Data Handling: Ensures consistent application of cross-cutting concerns like security policies, backup strategies, and other administrative tasks, as there's only one index to manage.

Decision Considerations

Query Performance vs. Management Simplicity: If query performance and language-specific optimizations are critical, language-specific indexes may be preferable. If simplicity of management and the ability to easily scale to new languages are more important, a single index might be better.
Resource Allocation: Consider whether your infrastructure can more efficiently handle many smaller indexes or a larger, more complex index.
Application Requirements: The nature of your application's search requirements—such as whether searches are typically performed within a single language or across multiple languages—can influence the best choice.
Future Expansion: If you anticipate needing to frequently add support for new languages, the ease of doing so with a single index might outweigh the benefits of separate indexes.

Ultimately, the choice between language-specific indexes and a single index with language-specific analyzers depends on your specific use case, including performance requirements, management capabilities, and future expansion plans.

fako commented 1 week ago

Done

surfedushare / harvester