tinode / chat

Instant messaging platform. Backend in Go. Clients: Swift iOS, Java Android, JS webapp, scriptable command line; chatbots
GNU General Public License v3.0
12.06k stars 1.88k forks source link

Search messages in conversation by keyword #675

Open gabriel-vasile opened 3 years ago

gabriel-vasile commented 3 years ago

I guess there is no ETA for this feature and no plans to ever implement it, but I'm willing to contribute the server code for it. Please tell me how you think this should be done with some references to the code and I'll open a pr.

or-else commented 3 years ago

Can you write up a proposal? Something along the lines of:

  1. Features, such as support/no support for a. Grammar, such as stemming. b. Chinese and other languages which require word segmentation. c. Language detection. d. What to do when the language is not supported.
  2. Location of query rewrite if any of the features from 1 require it.
  3. Changes to the external API
  4. DB organization, index structure.
  5. Cluster mode changes
gabriel-vasile commented 3 years ago

I was thinking about integrating something like Elasticsearch. Writing a search engine from scratch is not a PR, it's a full time job for a team.

or-else commented 3 years ago

In case of an external search provider I think the following is needed:

  1. Client-side API for sending search queries to the server and getting search results.
  2. Server-side API (plugin) for sending messages to the provider for indexing, sending queries and getting responses.
  3. At least one client with UI for creating queries and showing results.
gabriel-vasile commented 3 years ago

About the interaction between tinode and the search provider, there are two approaches to indexing:

  1. using a plugin, like you said
  2. let the search engine do it

For 1. there is the issue with existing messages. There needs to be a possibility to index all existing messages in case the index is lost, was just initialized, or any other reason. For 2., with Elastic at least, indexing is easily solved with a pipeline. Elastic supports mysql, mongo, and rethinkdb as data sources. Users need to provide a pipeline and Elastic will periodically query the database for new messages and index them. I'm not sure other search providers have this feature.

I think we should first decide if we are going to support more than one search engine and which one/s in particular. In my use case, supporting just Elastic is fine and it would make the implementation of this feature so much easier.

or-else commented 3 years ago

I would separate the concerns of starting a new service from scratch vs upgrading an existing service with message search.

I do see value of having Elastic or any other provider going to the DB directly. It also has drawbacks. For example, if we implement any sort of encryption at rest (a feature some people want) then the direct intake from the DB won't work.

I think we should first decide if we are going to support more than one search engine

I think there should be a choice. It does not need to be implemented immediately, a single provider is a good start. But there is value in an abstraction layer. Tinode is frequently used in organizations with an established infrastructure. If they use Solr or Algolia then it would be a harder decision if Tinode supports Elastic only.

gabriel-vasile commented 3 years ago

I guess with 'encryption at rest' you mean end-to-end encryption and not just server-side encryption. If that's the case, then there is no other choice but to let the clients do the search. Sorry, but I think I'll have to drop working on this as I'm not really familiar with any of the client SDKs neither the languages.

or-else commented 3 years ago

This is a useful feature. No need to close even if you don't want to work on it.

I meant what I said: encryption at rest.

gabriel-vasile commented 3 years ago

I meant what I said: encryption at rest.

What you said is not clear enough. You can have end-to-end encryption (clients have the encrypt/decrypt keys) or server-side encryption (the server has the encrypt/decrypt key). In both cases the data is encrypted "at rest". But one has access to the plain, unencrypted data on the server and allows you to search through it, the other doesn't.

rkgarcia commented 3 years ago

What about to use Full Text Search from Databases? With end-to-end encryption the search must be done in the client side

or-else commented 3 years ago

What about to use Full Text Search from Databases?

Rethinkdb does not have it at all. Mongo has no support for CJK - it can't split words. FTS in all three databases is mostly useless for heavily inflected languages.

So, it can be done for English with MySQL and maybe with Mongo but it will suck.

or-else commented 3 years ago

Elastic or sphinx or solr is not a bad idea.

ice-myles commented 1 year ago

Are there any planned release dates for the full text search and encryption in rest features? They are showed here in the planned section.

or-else commented 1 year ago

No. @ice-myles are you willing to help?