patrickfrey / strusAnalyzer

Library for document analysis (segmentation, tokenization, normalization, aggregation) with the goal to get a set of items that can be inserted into a strus storage. Also some functions for analysing tokens or phrases of the strus query are provided.
http://www.project-strus.net
Mozilla Public License 2.0
3 stars 0 forks source link

multi-valued attributes are not supported #42

Open andreasbaumann opened 7 years ago

andreasbaumann commented 7 years ago

I have a field 'subject' with the value:

 Passing (Identity) -- Fiction|Legal stories|Infants switched at birth -- Fiction|Missouri 
--Fiction|Trials (Murder) -- Fiction|Race relations -- Fiction|Impostors and imposture -- Fiction

When splitting it with:

    subject = orig regex("([^\|]+)") subject;   

only the last subject is inserted into the index.

What is the strategy to have multi-valued fields?

andreasbaumann commented 7 years ago

Possible applications: set of strings, we don't want to work around the problem by inserting a blob like 'A B|C E' into the attribute, as we cannot use the features like using the matching attributes in a summarizer.

patrickfrey commented 7 years ago

I will think about it.

andreasbaumann commented 7 years ago

Possible alternative:

example:

{"categories":["Hardware","NAS","Linux"]}

create a concat('|') function and generate an attribute like:

"Hardware|NAS|Linux"
patrickfrey commented 7 years ago

This is to consider but difficult. The segmenter treats each element of ["Hardware","NAS","Linux"] as an own segment. Currently, the only mechanisms for joining elements across segment borders are concatenation before tokenization, that is thought to be used for things like language detection, NLP, etc. and pattern matching. Only the later would be sane for this purpose. But also pattern matching would be a hack. We should consider more possibilities for assembling new elements in the analyzer.

At this point, I do not see a proper solution to the problem.