manticoresoftware / manticoresearch

Easy to use open source fast database for search | Good alternative to Elasticsearch now | Drop-in replacement for E in the ELK soon
https://manticoresearch.com
GNU General Public License v3.0
8.7k stars 484 forks source link

❓ Per-field tokenization (question for the community) ❓ #2006

Open sanikolaev opened 3 months ago

sanikolaev commented 3 months ago

Since the beginning, Sphinx and Manticore have not offered per-field tokenization settings (except for morphology_skip_fields and infix/prefix_fields), and it seems that there hasn't been much concern about this. On the other hand, if Manticore were to introduce this functionality, it would simplify certain use cases that require different tokenization, such as:

It would be interesting to know if the community considers it important to implement per-field tokenization settings in Manticore, similar to how it works in Elasticsearch and SOLR, allowing for the specification of tokenization settings for each field.

Furthermore, I'm curious how those who have been using Manticore for years have addressed this issue. I'm going to ask personally some Manticore users to provide feedback.

nickchomey commented 3 months ago

Could you please elaborate on what sorts of tokenization settings might be available on a per-field basis and some of the use cases/advantages for it?

sanikolaev commented 3 months ago

I think all the available tokenization settings would become per-field in this case:

etc.

some of the use cases/advantages for it?

Inviting @superkelvint as I know he knows a lot about it.

unterninja commented 3 months ago

Just to make sure: the mentioned performance reduction would only apply to tables where this feature is used, not on all tables regardless of tokenization model?

sanikolaev commented 3 months ago

the mentioned performance reduction would only apply to tables where this feature is used

The performance reduction mentioned would likely apply only to tables that utilize this feature. We would do our best to maintain the current level of performance in other aspects.

superkelvint commented 3 months ago

Common fields which require non-fulltext treatment include:

Numeric Codes and Identifiers

IDs and Part numbers

Internet

Legal

superkelvint commented 3 months ago

Perhaps also important to mention that for users planning to migrate from Lucene/Solr/Elasticsearch (like myself), not being able to specify analyzers per-field makes migrating extremely difficult because we are used to having this flexibility in Lucene-based systems and have therefore used this feature extensively.

Granted, Manticore does provide some support for this in the form of numeric, boolean, date field types. But that is very basic compared to Lucene, and applications would very likely have to lose functionality when migrating to Manticore which is a difficult pill to swallow.

ChrisHSandN commented 1 month ago

I came here to open a feature request for this specific feature (but spotted this post).

sanikolaev commented 1 month ago

@ChrisHSandN

It was therefore disappointing to find enabling this option disabled the ability to specify infix_fields option

Do you mean you used infix_fields, not just as a resource/performance optimization with dict=crc, but to make queries to some fields not run in infix mode (with probably expand_keywords=1)? If so, it shouldn't be a big deal (at least seems so to me, I'd need to check with the devs) to add support for it for the dict=keywords mode.

ChrisHSandN commented 1 month ago

@sanikolaev

Do you mean you used infix_fields, not just as a resource/performance optimization with dict=crc

We have a large amount of data indexed and only have the resources (and requirement) to infix certain (short) selected fields.

I tested swapping one of the indexes from dict=crc to dict=keyword and total .sp* file space increased 40% from 3.2GB to 4.5GB (.spa + .spi went from 0.26GB to 0.46GB; as we are memory limited these are the main limitation).

I was presuming this was due to dict=keywords infixing all the fields?

sanikolaev commented 1 month ago

@ChrisHSandN

we want only a subset of our fields expanded with infixes. We have always used dict=crc

Please make sure it actually worked for you. Here's an example showing infix_fields doesn't take effect with dict=crc:

mysql> drop table if exists t; create table t(f text, f2 text) dict='crc' infix_fields='f'; insert into t(id, f) values(1, 'abcdef'); select * from t where match('@f abc*');
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
create table t(f text, f2 text) dict='crc' infix_fields='f'
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
insert into t(id, f) values(1, 'abcdef')
--------------

Query OK, 1 row affected (0.01 sec)

--------------
select * from t where match('@f abc*')
--------------

Empty set (0.00 sec)
--- 0 out of 0 results in 0ms ---

Same with dict=keywords and min_infix_len works fine:

mysql> drop table if exists t; create table t(f text, f2 text) dict='keywords' min_infix_len='2' infix_fields='f'; insert into t(id, f) values(1, 'abcdef'); select * from t where match('@f abc*');
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
create table t(f text, f2 text) dict='keywords' min_infix_len='2' infix_fields='f'
--------------

Query OK, 0 rows affected, 1 warning (0.01 sec)

--------------
insert into t(id, f) values(1, 'abcdef')
--------------

Query OK, 1 row affected (0.00 sec)

--------------
select * from t where match('@f abc*')
--------------

+------+--------+------+
| id   | f      | f2   |
+------+--------+------+
|    1 | abcdef |      |
+------+--------+------+
1 row in set (0.00 sec)
--- 1 out of 1 results in 1ms ---

The point is that you can't enable min_infix_len for dict=crc:

mysql> drop table if exists t; create table t(f text, f2 text) dict='crc' min_infix_len='2';
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
create table t(f text, f2 text) dict='crc' min_infix_len='2'
--------------

ERROR 1064 (42000): error adding table 't': RT tables support prefixes and infixes with only dict=keywords

So could it be that you thought that infix_fields worked for you, but it actually didn't, an infix search wasn't effective at all and you didn't notice it?