Open yawitz opened 9 years ago
This is almost certainly either an RFE or a bug in the Search API for JSON Snippeting. Either way its on me to get a bug into bugtrack about it, and mark this as external when that happens.
I've got a fix ready for bugtrack number 31700. There is indeed a bug in JSON snippetting in 8.0-1 that needs fixing. When 8.0-2 ships, sampletstack will probably work with it; in which case this issue will simply be fixed.
In light of that, I'm going to take the milestone off of this bug, since fixing this in the middle tier would be work that would only last in a release for a couple of weeks, and there are higher priority bugfixes to accomplish.
I can show you @yawitz what the snippets look like on a nightly sometime; I think you'll be pleased. (once 31700 is checked in).
I'll pass to @wooldridge so that he's aware of this issue for node. I'll also mark external.
@wooldridge I'm just giving you this so you're aware of it. I don't think it matters much who holds it though.
I just checked the Samplestack develop branch running on the latest nightly and the snippets looked appropriately short in length. It appears the issue is fixed via the bug fix on ML that @grechaw describes. I will review with @yawitz and if it looks good we can move to QA.
Thanks, Mike. Let's make sure that everyone keeps in mind that the fix comes in with use of ML 8.0-2, so testing it requires switching to the nightly...
Looked over the site on webex with @yawitz and compared the snippets to what's in the wireframes. We agreed that the snippets are currently too short.
I tried upping the "max-snippet-chars" property in questions.json to rectify this, and we do get better results with higher values.
"max-snippet-chars" at 100 (original value): "max-snippet-chars" at 500: "max-snippet-chars" at 1000: "max-snippet-chars" at 2000:
However even at the high settings, some of the snippets are still a bit short. @grechaw is there any way you know of to improve this?
@yawitz can you look at the above screen shots and pick a setting to go with?
Looks like per match tokens might help with the too short ones...
I would start with a setting of 150, but then figure out why we're still getting too-short ones (which don't seem to be helped by very large settings).
The per-match-tokens does help a bit, here's a screen of max-snippet-chars at 150 and per-match-tokens at 18 (up from 12). But there are still some shorter groups of snippets. @yawitz we can fiddle with these setting more via webex if you'd like.
What is the setting for number of snippets? I see 2, 3 and 4 above. As for the too-short snippets (e.g. in the 3rd item above), we should get Charles involved to perhaps explain what's possibly going on before we start futzing with settings that may or may not do what we want.
The result with two snippets above includes all the instances of the search term ("html") within the two snippets. We have max-matches set to 4, so there may appear up to four snippets.
OK, that's helpful. Let's set the max-matches to 3. Can @grechaw comment on why some of the snippets are shorter than specified?
I think what we're seeing is a really big variation in the length of a token, for the purposes of max-snippet-tokens. If the whole snippet is short words, the snippet looks a lot shorter than if one contains a big long path. That's what it looks like to me anyhow. Feel free to adjust the settings however, though -- it's really supposed to be app-level configuration for the search api, and it affects the two tiers identically.
I'll send to @wooldridge to set the max-matches.
What's the interaction between per-match-tokens and max-snippet-chars? With the latter apparently set to 150 in the most recent screen shots (above), I see (in the 4th item) a snippet way longer than 150 chars. Unless I don't understand what a "snippet" is here. It is everything following the [Q] or [A] prefix, or the string between the ellipses? If the latter, do we have a way of specifying how many ellipses-ed strings we include, or the max chars for each [Q] or [A] section? Perhaps a realtime chat would help clarify what's happening here.
To reiterate where we're at...
The max-matches controls the number of questions and answers for a snippet in a result. You will see up to that many questions and answers in the snippets of a result. I can set that to 3 as requested by @yawitz; we're good there.
The per-match-tokens controls the size of the ellipses sections and that works consistently in my tests. @yawitz, you mentioned a too-short example in a comment above but I checked that and it was for a result where the search text was "html" and the ellipsis section was a URL at the end of a comment. So there was no trailing content to show. The example shows 9 preceding words, which is consistent with the per-match-tokens setting in that example of 18 (i.e., 8 or 9 words on each side of a term). @yawitz I believe you were OK with 12 (the original setting) for per-match-tokens.
It is still unclear to me what the max-snippet-chars corresponds to. In my tests, it doesn't correspond to the total snippet length for a given result, as @yawitz also notes above. Nor does it correspond to the total length of the snippet content for a question or answer, or the total length of the text between the ellipses.
I haven't discovered a way to control the number of ellipsis sections in a question or answer. I suspect we can't control that. @grechaw can you confirm?
Given all this information (and the discussions above), I'm attaching a screen shot for the following settings:
max-matches: 3 per-match-tokens: 12 max-snippet-chars: 150
This seems to give us the best snippet results given the constraints offered by the available settings. @yawitz if you're OK with this I can put it in a PR.
I think this is basically in the ballpark. Charles, I'm assigning this to you for comment; please let us know if we can control the total snippet length per Q or A excerpt. Otherwise, if this is the best we can do, then please pass it back to Mike for closure.
(I remain surprised that we don't have the control we need for overall length. Is this a gap in the snippeting function? Is it worth filing a RFE for it?)
I think these are all of the controls now available -- the 'per-match-tokens' is what's intended to control the individual match length, but its results are variable for things with big paths in it, it seems.
It's very much worth bringing up what works and doesn't for your design principles to @ehennum for his 8.0-3 JSearch planning.
So it looks like just one end to tie up in a PR -- back to you Mike
I just spoke to Erik about this, and his explanation for XML snippeting doesn't explain the problem we're having limiting the total length via the available settings. I'll setup a meeting for after MLW with all of us (Erik, Charles, Mike, Daphne) to discuss.
To test, run a Samplestack search with text in the top search box (e.g., "javascript"):
Note that, according to Erik, spaces and punctuation are also counted as tokens for this particular case. (More reason for all of us to get together and discuss, to clear up confusions such as these.)
There are tokens that markdown uses which I'm sure we'd prefer the snippetting algorithm to ignore for purposes of counting and making boundary decisions. Is there are way to put characters into an ignore list for snippetting purposes?
`,*,-,#,_ come to mind, but if even spaces are counted as tokens and we're dealing with software code, then our counts are going to be way off....
FWIW, the Search API uses cts.token() for tokenization, which is probably faster than a custom regex.
I suspect that getting the token count "just right" is less important than successfully limiting overall character count in the document snippet set. Something we can discuss in the meeting I just setup (if folks are too busy to continue to drill down on this right now).
I evaluated various Samplestack snippet settings running on MarkLogic version 8.0-2 and later:
https://wiki.marklogic.com/pages/viewpage.action?pageId=38015275
@yawitz decided that the following settings will give us snippets of the appropriate size:
max-snippet-chars: 500 max-matches: 3 per-match-tokens: 12
These settings are defined in: database/options/questions.json
I can submit a PR for this to the appropriate branch when the time is right.
Copied over from https://github.com/marklogic/marklogic-samplestack/issues/371#issuecomment-70585614
The snippet shown for the first A is longer than we want. Others are about right. We want to be able to limit each portion (Q or A reference) to a given char or word count.