patrickfrey / strusAnalyzer

Library for document analysis (segmentation, tokenization, normalization, aggregation) with the goal to get a set of items that can be inserted into a strus storage. Also some functions for analysing tokens or phrases of the strus query are provided.
http://www.project-strus.net
Mozilla Public License 2.0
3 stars 0 forks source link

How to do conditional indexing per language? #55

Open andreasbaumann opened 7 years ago

andreasbaumann commented 7 years ago

For instance:

<DOC>
  <META>
    <LANGUAGE>en</LANGUAGE>
  </META>
  <TEXT>
    <P>This is a text.</P>
  </TEXT>
</DOC>

I would think of an analyzer confiuguration:

[SearchIndex]
  word = lc:convdia(en) word /DOC/TEXT/P()[/DOC/META/LANGUAGE()="en";
  word = lc:convdia(de) word /DOC/TEXT/P()[/DOC/META/LANGUAGE()="de";

  stem = lc:convida():stem(en) word /DOC/TEXT/P()[/DOC/META/LANGUAGE()="en";
  stem = lc:convida():stem(de) word /DOC/TEXT/P()[/DOC/META/LANGUAGE()="de";

Of course I can always transform the document before and push the language attribute for instance into the TEXT or the P tag.

patrickfrey commented 7 years ago

Conditions on content are not provided by the standard segmenter for XML based on the textwolf library. Such conditions can only be handled properly with a segmenter based on an XML library based on a DOM model. Implementations fixing some cases with some kind of backtracking cannot cope with pathological expressions. We do not want to end in a model that is not complete. Using an attribute for the language in the example of this issue would make things clearer.

In order to deal with documents of this sort, we suggest to write a document segmenter implementation with a library based on a DOM model. LibXML could be a candidate. Another possibility could be the transformation of the document before analyzing it. Both solutions should be provided.