marklogic-community / marklogic-samplestack

A sample implementation of the MarkLogic Reference Architecture
Apache License 2.0
82 stars 56 forks source link

Snippet length needs to be controlled/shorter #511

Open yawitz opened 9 years ago

yawitz commented 9 years ago

Copied over from https://github.com/marklogic/marklogic-samplestack/issues/371#issuecomment-70585614

Q: Writing reusable codes in xquery and starting xquery / marklogic

[A] ...'t have a module clause). MarkLogic does provide an interesting construct,... ...supported by the latest version of MarkLogic) provides similar provisions like dynamic... ...building typical and complete webapps with MarkLogic, there are quite a number... ..., on which http://developer.marklogic.com is based. There is... ...in building REST api's in MarkLogic. In that case MarkLogic 6 has built-in features,... ...all available on http://developer.marklogic.com HTH! ... [Q] I am a starter in Marklogic and Xquery. - - -... ...interface etc) possible in xquery with marklogic. 2. Where to start from... ...'building my Hello World application in MarkLogic / XQuery'? - - -... ... [Q] ...in xquery and starting xquery / marklo

The snippet shown for the first A is longer than we want. Others are about right. We want to be able to limit each portion (Q or A reference) to a given char or word count.

grechaw commented 9 years ago

This is almost certainly either an RFE or a bug in the Search API for JSON Snippeting. Either way its on me to get a bug into bugtrack about it, and mark this as external when that happens.

grechaw commented 9 years ago

I've got a fix ready for bugtrack number 31700. There is indeed a bug in JSON snippetting in 8.0-1 that needs fixing. When 8.0-2 ships, sampletstack will probably work with it; in which case this issue will simply be fixed.

In light of that, I'm going to take the milestone off of this bug, since fixing this in the middle tier would be work that would only last in a release for a couple of weeks, and there are higher priority bugfixes to accomplish.

I can show you @yawitz what the snippets look like on a nightly sometime; I think you'll be pleased. (once 31700 is checked in).

I'll pass to @wooldridge so that he's aware of this issue for node. I'll also mark external.

grechaw commented 9 years ago

@wooldridge I'm just giving you this so you're aware of it. I don't think it matters much who holds it though.

wooldridge commented 9 years ago

I just checked the Samplestack develop branch running on the latest nightly and the snippets looked appropriately short in length. It appears the issue is fixed via the bug fix on ML that @grechaw describes. I will review with @yawitz and if it looks good we can move to QA.

laurelnaiad commented 9 years ago

Thanks, Mike. Let's make sure that everyone keeps in mind that the fix comes in with use of ML 8.0-2, so testing it requires switching to the nightly...

wooldridge commented 9 years ago

Looked over the site on webex with @yawitz and compared the snippets to what's in the wireframes. We agreed that the snippets are currently too short.

I tried upping the "max-snippet-chars" property in questions.json to rectify this, and we do get better results with higher values.

"max-snippet-chars" at 100 (original value): snippets_100 "max-snippet-chars" at 500: snippets_500 "max-snippet-chars" at 1000: snippets_1000 "max-snippet-chars" at 2000: snippets_2000

However even at the high settings, some of the snippets are still a bit short. @grechaw is there any way you know of to improve this?

@yawitz can you look at the above screen shots and pick a setting to go with?

grechaw commented 9 years ago

Looks like per match tokens might help with the too short ones...

yawitz commented 9 years ago

I would start with a setting of 150, but then figure out why we're still getting too-short ones (which don't seem to be helped by very large settings).

wooldridge commented 9 years ago

The per-match-tokens does help a bit, here's a screen of max-snippet-chars at 150 and per-match-tokens at 18 (up from 12). But there are still some shorter groups of snippets. @yawitz we can fiddle with these setting more via webex if you'd like. snippets_150_18

yawitz commented 9 years ago

What is the setting for number of snippets? I see 2, 3 and 4 above. As for the too-short snippets (e.g. in the 3rd item above), we should get Charles involved to perhaps explain what's possibly going on before we start futzing with settings that may or may not do what we want.

wooldridge commented 9 years ago

The result with two snippets above includes all the instances of the search term ("html") within the two snippets. We have max-matches set to 4, so there may appear up to four snippets.

yawitz commented 9 years ago

OK, that's helpful. Let's set the max-matches to 3. Can @grechaw comment on why some of the snippets are shorter than specified?

grechaw commented 9 years ago

I think what we're seeing is a really big variation in the length of a token, for the purposes of max-snippet-tokens. If the whole snippet is short words, the snippet looks a lot shorter than if one contains a big long path. That's what it looks like to me anyhow. Feel free to adjust the settings however, though -- it's really supposed to be app-level configuration for the search api, and it affects the two tiers identically.

grechaw commented 9 years ago

I'll send to @wooldridge to set the max-matches.

yawitz commented 9 years ago

What's the interaction between per-match-tokens and max-snippet-chars? With the latter apparently set to 150 in the most recent screen shots (above), I see (in the 4th item) a snippet way longer than 150 chars. Unless I don't understand what a "snippet" is here. It is everything following the [Q] or [A] prefix, or the string between the ellipses? If the latter, do we have a way of specifying how many ellipses-ed strings we include, or the max chars for each [Q] or [A] section? Perhaps a realtime chat would help clarify what's happening here.

wooldridge commented 9 years ago

To reiterate where we're at...

The max-matches controls the number of questions and answers for a snippet in a result. You will see up to that many questions and answers in the snippets of a result. I can set that to 3 as requested by @yawitz; we're good there.

The per-match-tokens controls the size of the ellipses sections and that works consistently in my tests. @yawitz, you mentioned a too-short example in a comment above but I checked that and it was for a result where the search text was "html" and the ellipsis section was a URL at the end of a comment. So there was no trailing content to show. The example shows 9 preceding words, which is consistent with the per-match-tokens setting in that example of 18 (i.e., 8 or 9 words on each side of a term). @yawitz I believe you were OK with 12 (the original setting) for per-match-tokens.

It is still unclear to me what the max-snippet-chars corresponds to. In my tests, it doesn't correspond to the total snippet length for a given result, as @yawitz also notes above. Nor does it correspond to the total length of the snippet content for a question or answer, or the total length of the text between the ellipses.

I haven't discovered a way to control the number of ellipsis sections in a question or answer. I suspect we can't control that. @grechaw can you confirm?

Given all this information (and the discussions above), I'm attaching a screen shot for the following settings:

max-matches: 3 per-match-tokens: 12 max-snippet-chars: 150

This seems to give us the best snippet results given the constraints offered by the available settings. @yawitz if you're OK with this I can put it in a PR.

snippets_150_3_12

yawitz commented 9 years ago

I think this is basically in the ballpark. Charles, I'm assigning this to you for comment; please let us know if we can control the total snippet length per Q or A excerpt. Otherwise, if this is the best we can do, then please pass it back to Mike for closure.

(I remain surprised that we don't have the control we need for overall length. Is this a gap in the snippeting function? Is it worth filing a RFE for it?)

grechaw commented 9 years ago

I think these are all of the controls now available -- the 'per-match-tokens' is what's intended to control the individual match length, but its results are variable for things with big paths in it, it seems.

It's very much worth bringing up what works and doesn't for your design principles to @ehennum for his 8.0-3 JSearch planning.

grechaw commented 9 years ago

So it looks like just one end to tie up in a PR -- back to you Mike

yawitz commented 9 years ago

I just spoke to Erik about this, and his explanation for XML snippeting doesn't explain the problem we're having limiting the total length via the available settings. I'll setup a meeting for after MLW with all of us (Erik, Charles, Mike, Daphne) to discuss.

wooldridge commented 9 years ago

To test, run a Samplestack search with text in the top search box (e.g., "javascript"):

  1. In the snippet content below each result title, there should be up to three question and answer sections total.
  2. There should be approximately 12 words between the sets of ellipses ("..."). (This won't always be the case depending on the context of a search term. Just make sure you're not seeing lots of strings of more than 12 words.)
yawitz commented 9 years ago

Note that, according to Erik, spaces and punctuation are also counted as tokens for this particular case. (More reason for all of us to get together and discuss, to clear up confusions such as these.)

laurelnaiad commented 9 years ago

There are tokens that markdown uses which I'm sure we'd prefer the snippetting algorithm to ignore for purposes of counting and making boundary decisions. Is there are way to put characters into an ignore list for snippetting purposes?

laurelnaiad commented 9 years ago

`,*,-,#,_ come to mind, but if even spaces are counted as tokens and we're dealing with software code, then our counts are going to be way off....

ehennum commented 9 years ago

FWIW, the Search API uses cts.token() for tokenization, which is probably faster than a custom regex.

yawitz commented 9 years ago

I suspect that getting the token count "just right" is less important than successfully limiting overall character count in the document snippet set. Something we can discuss in the meeting I just setup (if folks are too busy to continue to drill down on this right now).

wooldridge commented 9 years ago

I evaluated various Samplestack snippet settings running on MarkLogic version 8.0-2 and later:

https://wiki.marklogic.com/pages/viewpage.action?pageId=38015275

@yawitz decided that the following settings will give us snippets of the appropriate size:

max-snippet-chars: 500 max-matches: 3 per-match-tokens: 12

These settings are defined in: database/options/questions.json

I can submit a PR for this to the appropriate branch when the time is right.