❓ Per-field tokenization (question for the community) ❓

sanikolaev commented 3 months ago

Since the beginning, Sphinx and Manticore have not offered per-field tokenization settings (except for morphology_skip_fields and infix/prefix_fields), and it seems that there hasn't been much concern about this. On the other hand, if Manticore were to introduce this functionality, it would simplify certain use cases that require different tokenization, such as:

Storing titles/descriptions along with SKU numbers (e.g., ABC-12345-S-BL).
Managing titles/descriptions and email/IP addresses in the same table.

It would be interesting to know if the community considers it important to implement per-field tokenization settings in Manticore, similar to how it works in Elasticsearch and SOLR, allowing for the specification of tokenization settings for each field.

Furthermore, I'm curious how those who have been using Manticore for years have addressed this issue. I'm going to ask personally some Manticore users to provide feedback.

nickchomey commented 3 months ago

Could you please elaborate on what sorts of tokenization settings might be available on a per-field basis and some of the use cases/advantages for it?

sanikolaev commented 3 months ago

I think all the available tokenization settings would become per-field in this case:

charset_table
morphology
blend_chars
ignore_chars
stopwords
exceptions
wordforms

etc.

some of the use cases/advantages for it?

Inviting @superkelvint as I know he knows a lot about it.

unterninja commented 3 months ago

Just to make sure: the mentioned performance reduction would only apply to tables where this feature is used, not on all tables regardless of tokenization model?

sanikolaev commented 3 months ago

the mentioned performance reduction would only apply to tables where this feature is used

The performance reduction mentioned would likely apply only to tables that utilize this feature. We would do our best to maintain the current level of performance in other aspects.

superkelvint commented 3 months ago

Common fields which require non-fulltext treatment include:

Numeric Codes and Identifiers

ISBNs: Unique identifiers for books that should be searchable in their entirety.
SSNs (Social Security Numbers): For applications that require identity verification, SSNs need exact match searching without tokenization.
Vehicle Identification Numbers (VINs): Each VIN is unique to a specific vehicle and must be searched precisely.

IDs and Part numbers

Model Numbers: "Model XR-2000" should remain unaltered for exact model searches.
SKUs: e.g. "ELEC-12345-BLU", "SHOE-98765-M-8"
ASIN (Amazon Standard Identification Numbers): Unique blocks of letters and/or numbers for identifying items on Amazon. e.g. B0825K99RP
Parts Numbers: "6E5-45371-01"
Electronic Component Identifiers: Unique codes used for electronic components in manufacturing and assembly, like resistors, capacitors, and integrated circuits, e.g. "ATMEGA328P-PU"

Internet

IP addresses
URLs
email addresses
Twitter hashtags and @ mentions: "#ThrowbackThursday" needs to be indexed as a single token for hashtag-based searches "@username" should be searchable as a distinct token to find mentions of specific users.
File system paths: c:\Users\MyDocuments or /home/user/documents

Legal

Legal Terms: "Ex post facto" should not be stemmed to preserve its specific legal context.
Case Names: "Roe v. Wade, 410 U.S. 113" must be tokenized as a whole entity for precise legal reference searching.

superkelvint commented 3 months ago

Perhaps also important to mention that for users planning to migrate from Lucene/Solr/Elasticsearch (like myself), not being able to specify analyzers per-field makes migrating extremely difficult because we are used to having this flexibility in Lucene-based systems and have therefore used this feature extensively.

Granted, Manticore does provide some support for this in the form of numeric, boolean, date field types. But that is very basic compared to Lucene, and applications would very likely have to lose functionality when migrating to Manticore which is a difficult pill to swallow.

ChrisHSandN commented 1 month ago

I came here to open a feature request for this specific feature (but spotted this post).

Our use case for manticore means we want only a subset of our fields expanded with infixes.
We have always used dict=crc (since the early days of Sphinx) but reading the Manticore docs recently made dict=keyword sound appealing (extra wildcard characters, smaller indexes etc.)
It was therefore disappointing to find enabling this option disabled the ability to specify infix_fields option.

sanikolaev commented 1 month ago

@ChrisHSandN

It was therefore disappointing to find enabling this option disabled the ability to specify infix_fields option

Do you mean you used infix_fields, not just as a resource/performance optimization with dict=crc, but to make queries to some fields not run in infix mode (with probably expand_keywords=1)? If so, it shouldn't be a big deal (at least seems so to me, I'd need to check with the devs) to add support for it for the dict=keywords mode.

ChrisHSandN commented 1 month ago

@sanikolaev

Do you mean you used infix_fields, not just as a resource/performance optimization with dict=crc

We have a large amount of data indexed and only have the resources (and requirement) to infix certain (short) selected fields.

I tested swapping one of the indexes from dict=crc to dict=keyword and total .sp* file space increased 40% from 3.2GB to 4.5GB (.spa + .spi went from 0.26GB to 0.46GB; as we are memory limited these are the main limitation).

I was presuming this was due to dict=keywords infixing all the fields?

sanikolaev commented 1 month ago

@ChrisHSandN

we want only a subset of our fields expanded with infixes. We have always used dict=crc

Please make sure it actually worked for you. Here's an example showing infix_fields doesn't take effect with dict=crc:

mysql> drop table if exists t; create table t(f text, f2 text) dict='crc' infix_fields='f'; insert into t(id, f) values(1, 'abcdef'); select * from t where match('@f abc*');
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
create table t(f text, f2 text) dict='crc' infix_fields='f'
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
insert into t(id, f) values(1, 'abcdef')
--------------

Query OK, 1 row affected (0.01 sec)

--------------
select * from t where match('@f abc*')
--------------

Empty set (0.00 sec)
--- 0 out of 0 results in 0ms ---

Same with dict=keywords and min_infix_len works fine:

mysql> drop table if exists t; create table t(f text, f2 text) dict='keywords' min_infix_len='2' infix_fields='f'; insert into t(id, f) values(1, 'abcdef'); select * from t where match('@f abc*');
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
create table t(f text, f2 text) dict='keywords' min_infix_len='2' infix_fields='f'
--------------

Query OK, 0 rows affected, 1 warning (0.01 sec)

--------------
insert into t(id, f) values(1, 'abcdef')
--------------

Query OK, 1 row affected (0.00 sec)

--------------
select * from t where match('@f abc*')
--------------

+------+--------+------+
| id   | f      | f2   |
+------+--------+------+
|    1 | abcdef |      |
+------+--------+------+
1 row in set (0.00 sec)
--- 1 out of 1 results in 1ms ---

The point is that you can't enable min_infix_len for dict=crc:

mysql> drop table if exists t; create table t(f text, f2 text) dict='crc' min_infix_len='2';
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
create table t(f text, f2 text) dict='crc' min_infix_len='2'
--------------

ERROR 1064 (42000): error adding table 't': RT tables support prefixes and infixes with only dict=keywords

So could it be that you thought that infix_fields worked for you, but it actually didn't, an infix search wasn't effective at all and you didn't notice it?

manticoresoftware / manticoresearch