surfedushare / harvester

ETL pipeline and search engine for Edusources and Publinova
MIT License
2 stars 0 forks source link

[EDU, PUB] Switch to 1 index: Language analyzers should work per-field not per-index #32

Closed fako closed 1 week ago

fako commented 1 year ago

Currently we create OpenSearch index configurations per language. This was once a recommendation from SURF Sara team. We can perhaps ask why and what kind of problems may arise from creating a single configuration for all languages and then handling language on a field-by-field basis. However it's quite certain our current approach has some problems as described with these issues:

fako commented 5 months ago

This is a GPT response that has been fine-tuned to help with ElasticSearch related questions. I think the recommendations make sense. @peterdubbeldsurf

The strategy of segregating documents into language-specific indexes versus consolidating them into a single index with language analyzers specified per field hinges on several factors, each with its own set of benefits and potential drawbacks. Understanding these can help you make informed decisions about the best architecture for your project's needs. Let's explore the advantages of both approaches:

Language-Specific Indexes (e.g., index-nl, index-en, index-de)

  1. Optimized Search Performance: Queries can be faster because each index is smaller and more focused, containing only documents of a specific language. This can lead to quicker search and indexing operations, especially when the query is language-specific.

  2. Customized Analysis and Optimization: Each index can have its own set of analyzers, tokenizers, and filters tailored specifically to the language's nuances. This allows for more precise control over the search and analysis process, potentially improving relevancy and accuracy.

  3. Simplified Management for Language-Specific Data: If there's a need to update mappings, settings, or analyzers for a particular language, having separate indexes makes this process more straightforward and reduces the risk of inadvertently impacting documents in other languages.

  4. Scalability and Resource Allocation: It's easier to allocate resources and scale the infrastructure for individual languages based on demand. For instance, a more frequently searched language can be allocated more resources.

  5. Security and Access Control: Easier to implement language-based access control policies. If there are requirements to restrict access based on language, separate indexes can facilitate this.

Single Index with Language-Specific Analyzers

  1. Unified View: A single index provides a unified view of the data, making it simpler to perform cross-language searches and aggregations. This can be particularly beneficial for multinational or multilingual applications.

  2. Reduced Overhead: Managing one index instead of multiple reduces the complexity and overhead associated with index management, such as creation, deletion, and maintenance operations.

  3. Flexible Language Support: It's easier to add support for new languages by updating the single index's settings to include new analyzers, rather than creating and managing a new index.

  4. Resource Efficiency: A single index might use resources more efficiently by avoiding the duplication of overhead associated with each index, such as metadata storage and management processes.

  5. Consistent Data Handling: Ensures consistent application of cross-cutting concerns like security policies, backup strategies, and other administrative tasks, as there's only one index to manage.

Decision Considerations

Ultimately, the choice between language-specific indexes and a single index with language-specific analyzers depends on your specific use case, including performance requirements, management capabilities, and future expansion plans.

fako commented 1 week ago

Done