patrickfrey / strusAnalyzer

Library for document analysis (segmentation, tokenization, normalization, aggregation) with the goal to get a set of items that can be inserted into a strus storage. Also some functions for analysing tokens or phrases of the strus query are provided.
http://www.project-strus.net
Mozilla Public License 2.0
3 stars 0 forks source link

using nested segmenters and positions #49

Closed andreasbaumann closed 7 years ago

andreasbaumann commented 7 years ago

I have a subsection JSON inside an XML segmenter. How can I make sure, that indexing JSON (where fields like title have no order) have a specific order and come before another section in the XML.

example:

<meta>
{"title":"Andreas Baumann's Personal Home Page"}
</meta>
<body>
<para>
  Using a static HTML generator now called

vs.

<meta>
{"categories":["Hardware","NAS","Linux"],"date":"2017-01-21T14:10:11+01:00","thumbnail":"/images/blog/a-nas-tale/a-nas-tale.png","title":"A NAS tale"}
</meta>
<body>
<para>
  In August 2009 I decided it was time to replace my old Pentium II

the analyzer configuration looks as follows:

[SearchIndex]
    word = lc:convdia(en):stem(en):lc regex("([A-Za-z']+)") /posts/post/meta()/title();
    word = lc:convdia(en):stem(en):lc regex("([A-Za-z']+)") /posts/post/body//para();

[ForwardIndex]
    title = orig split /posts/post/meta()/title();
    text = orig split /posts/post/body//para();

I see:

1 title 'Andreas'
2 title 'Baumann's'
3 title 'Personal'
4 title 'Home'
5 title 'Page'
6 text 'Using'
7 text 'a'
8 text 'static'

which is ok and:

1 text 'In'
2 text 'August'
3 text '2009'
4 text 'I'
5 text 'decided'
6 text 'it'
7 text 'was'
8 text 'time'
...
63 text 'the'
64 text 'job.'
65 title 'A'
66 title 'NAS'
67 title 'tale'
68 text 'Almost'
69 text 'exactly'
70 text 'a'
...

What I would like to declare is that the text in para starts at a certain offset after all fields in meta.

patrickfrey commented 7 years ago

You want to have gaps? Why? Is it not possible to work with delimiters?

andreasbaumann commented 7 years ago

How would this look like in the example?

andreasbaumann commented 7 years ago

For me it's quite logical, that the content in JSON format in the meta tag should not overlap position-wise with the content:

<meta>
{"title":"This is a title"}
</meta>
<body>
<para>
 This is text

Positions like:

1 Title
1 This
2 is
2 is
3 a
3 text
4 title

seems quite illogical to me.

So having a parameter for steering the position in the document analyzer would be handy.

Of course I can add a separator between meta and body, but this would complicate matters IMHO.

andreasbaumann commented 7 years ago

Actually, my example is wrong, there is the correct one:

forward index terms:
1 text 'In'
2 text 'August'
3 text '2009'
4 text 'I'
5 text 'decided'
6 text 'it'
...
69 text 'did'
70 text 'the'
71 text 'job.'
73 title 'A'
74 title 'NAS'
75 title 'tale'
76 text 'Almost'
77 text 'exactly'
78 text 'a'
79 text 'year'

The corresponding document looks like:

<meta>
{"categories":["Hardware","NAS","Linux"],"date":"2017-01-21T14:10:11+01:00","thumbnail":"/images/blog/a-nas-tale/a-nas-tale.png","title":"A NAS tale"}
</meta>
<body>
<para>
  In August 2009 I decided it was time to replace my old Pentium II
...

So the positions between segmenters get intermixed.

I checked again with newest sources and the effect is the same.

The configuration is:

[SearchIndex]
    word = lc:convdia(en):stem(en):lc regex("([A-Za-z']+)") /posts/post/meta()/title();
    word = lc:convdia(en):stem(en):lc regex("([A-Za-z']+)") /posts/post/body//para();
    sentence = empty punctuation("en") /posts/post/body//para();

[ForwardIndex]
    title = orig split /posts/post/meta()/title();
    text = orig split /posts/post/body//para();
patrickfrey commented 7 years ago

Positions do not get intermixed, at least with the current version. Ordinal positions are assigned in ascending order of appearance in the document. There is currently no other way to stear ordinal position assignments except binding of positions to predecessor and successor elements. You can use structural elements like end of sentence or end of tag in expressions to supress matches crossing structure boundaries. Explicit position assignment does not make sense except as hack if structure elements are not available. The performance argument applies only for simple retrieval models. When contextual relations come in place, you have no way around structure elements.

andreasbaumann commented 7 years ago

After complete recompilation I still get the same behaviour.

andreasbaumann commented 7 years ago

strusAnalyze document.ana posts.xml |& less

skip to second NAS, the one in the forward index:

...
71 text 'job.'
72 title 'A'
73 title 'NAS'
74 title 'tale'
76 text 'Almost'

democase.zip

In most cases, the title features precede the text features, the exception is when title is not the first metadata in the JSON-subsegmenter.