How to split document in tokens in a mixed tagged format

andreasbaumann commented 7 years ago

Using:

[SearchIndex]
    word = lc:convdia(en):stem(en):lc regex("([A-Za-z']+)") /posts/post/body//para();

[ForwardIndex]
    text = orig split /posts/post/body//para();

I get:

6 text 'Using'
7 text 'a'
8 text 'static'
9 text 'HTML'
10 text 'generator'
11 text 'now'
12 text 'called'
13 text 'Hugo'
14 text '.'
15 text 'Before'
16 text 'I'
17 text 'used'
18 text 'HTML'
19 text 'and'
20 text 'server-side-includes.'
23 text 'Synchronization'
24 text 'is'
25 text 'done'
26 text 'with'
27 text 'rsync'
28 text 'over'
29 text 'ssh.'

Documentation says it's a split on whitespace. Why do I get sometimes '.' and somtimes 'word.'?

Does it depend on the way I'm analyzing for the search index?

andreasbaumann commented 7 years ago

Ah: the original text contains tags:

<para>
  Using a static HTML generator now called
  <ulink url="https://gohugo.io/">Hugo</ulink>. Before I used HTML and
  server-side-includes. Synchronization is done with rsync over ssh. If
  you ask yourselves, why no CMS, well, the two wikis/CMS I had before
  (I don't mention names) were hacked in no time. And don't want to
  spend any time doing security updates all the time.

So the single . I get after a </ulink>. Otherwise split does indeed separate by whitespace.

patrickfrey commented 7 years ago

Tokens crossing segment borders are always splitted.

patrickfrey / strusAnalyzer

How to split document in tokens in a mixed tagged format #53