patrickfrey / strusAnalyzer

Library for document analysis (segmentation, tokenization, normalization, aggregation) with the goal to get a set of items that can be inserted into a strus storage. Also some functions for analysing tokens or phrases of the strus query are provided.
http://www.project-strus.net
Mozilla Public License 2.0
3 stars 0 forks source link

How to split document in tokens in a mixed tagged format #53

Open andreasbaumann opened 7 years ago

andreasbaumann commented 7 years ago

Using:

[SearchIndex]
    word = lc:convdia(en):stem(en):lc regex("([A-Za-z']+)") /posts/post/body//para();

[ForwardIndex]
    text = orig split /posts/post/body//para();

I get:

6 text 'Using'
7 text 'a'
8 text 'static'
9 text 'HTML'
10 text 'generator'
11 text 'now'
12 text 'called'
13 text 'Hugo'
14 text '.'
15 text 'Before'
16 text 'I'
17 text 'used'
18 text 'HTML'
19 text 'and'
20 text 'server-side-includes.'
23 text 'Synchronization'
24 text 'is'
25 text 'done'
26 text 'with'
27 text 'rsync'
28 text 'over'
29 text 'ssh.'

Documentation says it's a split on whitespace. Why do I get sometimes '.' and somtimes 'word.'?

Does it depend on the way I'm analyzing for the search index?

andreasbaumann commented 7 years ago

Ah: the original text contains tags:

<para>
  Using a static HTML generator now called
  <ulink url="https://gohugo.io/">Hugo</ulink>. Before I used HTML and
  server-side-includes. Synchronization is done with rsync over ssh. If
  you ask yourselves, why no CMS, well, the two wikis/CMS I had before
  (I don't mention names) were hacked in no time. And don't want to
  spend any time doing security updates all the time.

So the single . I get after a </ulink>. Otherwise split does indeed separate by whitespace.

patrickfrey commented 7 years ago

Tokens crossing segment borders are always splitted.