Closed andreasbaumann closed 7 years ago
You want to have gaps? Why? Is it not possible to work with delimiters?
How would this look like in the example?
For me it's quite logical, that the content in JSON format in the meta
tag should not overlap
position-wise with the content:
<meta>
{"title":"This is a title"}
</meta>
<body>
<para>
This is text
Positions like:
1 Title
1 This
2 is
2 is
3 a
3 text
4 title
seems quite illogical to me.
So having a parameter for steering the position in the document analyzer would be handy.
Of course I can add a separator between meta and body, but this would complicate matters IMHO.
Actually, my example is wrong, there is the correct one:
forward index terms:
1 text 'In'
2 text 'August'
3 text '2009'
4 text 'I'
5 text 'decided'
6 text 'it'
...
69 text 'did'
70 text 'the'
71 text 'job.'
73 title 'A'
74 title 'NAS'
75 title 'tale'
76 text 'Almost'
77 text 'exactly'
78 text 'a'
79 text 'year'
The corresponding document looks like:
<meta>
{"categories":["Hardware","NAS","Linux"],"date":"2017-01-21T14:10:11+01:00","thumbnail":"/images/blog/a-nas-tale/a-nas-tale.png","title":"A NAS tale"}
</meta>
<body>
<para>
In August 2009 I decided it was time to replace my old Pentium II
...
So the positions between segmenters get intermixed.
I checked again with newest sources and the effect is the same.
The configuration is:
[SearchIndex]
word = lc:convdia(en):stem(en):lc regex("([A-Za-z']+)") /posts/post/meta()/title();
word = lc:convdia(en):stem(en):lc regex("([A-Za-z']+)") /posts/post/body//para();
sentence = empty punctuation("en") /posts/post/body//para();
[ForwardIndex]
title = orig split /posts/post/meta()/title();
text = orig split /posts/post/body//para();
Positions do not get intermixed, at least with the current version. Ordinal positions are assigned in ascending order of appearance in the document. There is currently no other way to stear ordinal position assignments except binding of positions to predecessor and successor elements. You can use structural elements like end of sentence or end of tag in expressions to supress matches crossing structure boundaries. Explicit position assignment does not make sense except as hack if structure elements are not available. The performance argument applies only for simple retrieval models. When contextual relations come in place, you have no way around structure elements.
After complete recompilation I still get the same behaviour.
strusAnalyze document.ana posts.xml |& less
skip to second NAS
, the one in the forward index:
...
71 text 'job.'
72 title 'A'
73 title 'NAS'
74 title 'tale'
76 text 'Almost'
In most cases, the title
features precede the text
features, the exception is when title
is not the
first metadata in the JSON-subsegmenter.
I have a subsection JSON inside an XML segmenter. How can I make sure, that indexing JSON (where fields like
title
have no order) have a specific order and come before another section in the XML.example:
vs.
the analyzer configuration looks as follows:
I see:
which is ok and:
What I would like to declare is that the text in
para
starts at a certain offset after all fields inmeta
.