projectEndings / staticSearch

A codebase to support a pure JSON search engine requiring no backend for any XHTML5 document collection
https://endings.uvic.ca/staticSearch/docs/index.html
Mozilla Public License 2.0
51 stars 22 forks source link

FR: kwicLimit is not configurable yet #30

Closed martindholmes closed 4 years ago

martindholmes commented 4 years ago

We have an integer setting kwicLimit (the maximum number of kwics to be displayed for each document when showing results on the search page. The JS search object reads this from an attribute @data-kwicLimit on the <form> element. But currently there's nothing in the config file or the build process to set this attribute. I would like to make it settable, because if you elect to support phrasal search, you're already generating all the kwics exhaustively; there's no reason not to allow the site to display them if they exist. I noticed that if you don't do this, you can end up with puzzling results if you search for +[common word] +[unusual word], where the current default kwicLimit of 10 may contain no instances of [unusual word] at all, making it look like it wasn't found.

Since this could result in a massive list of kwics for some searches, we could set it up so the first X are shown, but the remainder available as a summary/detail expansion; or maybe we just don't care, and render everything that's there.

martindholmes commented 4 years ago

I think I have a good approach for this:

  1. Rename the config item <maxContexts> to <maxKwics>. This removes the confusion around the two uses of the word "context", one meaning a token and its surrounding text, and the other meaning a document context within which a kwic is constrained. We should use "context" only to refer to the latter, I think, and "kwic" to refer to the former.

  2. Pass the value of <maxKwics> through to the JS as @data-maxKwics on the <form> element, and have the JS read that.

  3. Rename the JS kwicLimit properties to maxKwics for consistency (the main StaticSearch object has one, and so does the ResultSet object, set by the main object).

In this scenario, that setting has two functions. First, if you don't have phrasal searches turned on, then it constrains the number of kwics retrieved for each token in a given document. If you are doing phrasal search, then we ignore it and harvest all kwics. Second, in the JS search page, it limits the number of kwics returned for each document in a result set. Even if you have set a limit to the number of kwics harvested, because you can search for multiple tokens, it's easy to have more kwics in your result for a particular document than the maxKwics value, so that constraint is useful.

The only thing I'm not sure about is what the default value should be. 10?

@joeytakeda What do you think?

joeytakeda commented 4 years ago

That sounds like a good plan to me. Should we add a schematron warning or a diagnostic if someone puts a <maxKwics> and has phrasal search turned on?

10 sounds fine as the default, but I think we shouldn't get rid of them out of the DOM but do this in CSS.

.kwic > li:nth-child($kwicLimit + n){
display:none;
}

We could then think about having a button of some sort that just overrides that CSS rule:

input.showAll:checked ~ .kwic > li{
display:block;
}
martindholmes commented 4 years ago

Even if you have phrasalSearch turned on, <maxKwics> has a purpose; it constrains the number of kwics that are shown in the search page for any given document. So I don't think we should issue a warning.

As far as DOM vs CSS is concerned, I think I agree that it would be better to make them switchable, but that might annoy a user who really wanted to constrain the number shown for some reason. In one sense it's easier from our point of view to handle this with CSS, but then we need to add a control to the page, and make it visible only when it's needed. Imagine a situation where there are 15 hit documents, and all have 10 kwics or less except the last one; that would mean that the Show All control would appear to have no effect, since the only list expanded would be offscreen. Or were you thinking that each list that needs one would have its own button?

joeytakeda commented 4 years ago

I was imagining that each list would have its own button, since that would make the whole solution pure CSS. Each input would control the list that is an immediately following sibling of it.

Is there a reason why the user would want to constraint the number of KWICs and have phrasal searches on? I can understand not wanting to show them all by default for readability, but I think it makes sense to have them all available if they are generated--as far as I understand it, if you have phrasal search turned on, then the JS has to get the whole object anyway, so you aren't downloading less or more if you hide some results. And, if the user really wanted to have phrasal searches turned on AND hide some number of results, they could just hide the button in their CSS.

martindholmes commented 4 years ago

Absolutely the user might want to constrain kwics even though they're using phrasal searches; I was just thinking I might want to do that myself for DVPP. The corollary is, though, that the kwics of MUST_CONTAINs and PHRASALs must come first in the list, otherwise you could end up in a situation where although you've specified phrases and pluses, the kwics you see actually only happen to contain the common word you didn't really care about but just added to the search. That might be a bit tricky, because kwics are added to the result set token by token.

joeytakeda commented 4 years ago

That’s a good point. Though, when I was saying “user” I really meant “implementer,” not the end user who searches. But I agree that a user may want to only show the first few results, but the complexities you mention above need to be thought through.

So I think a secondary FR would be to allow users to redefine the KWIC limit since they may want to see all of them or only one per; of course, this means that the JSONs would always have to include all KWICs.

But for this request, I still think that an end user should have the ability to see all results for a particular document, even if the implementer has decided a maximum number of 5 or whatever. If we do this in CSS, it can be a input with a label with a following sibling; if JS, then it’s just an added class to the KWIC that would force all to be displayed. What do you think?

martindholmes commented 4 years ago

I really wasn't thinking that the user would configure this; I was thinking that the implementer knows that (say) there are hundreds of kwics for common searches, so they decide that although they want all those kwics to be available to the searcher if required, they think it's more helpful to limit them initially. So they just set the maxKwics value in their config.

The implications for the ordering of kwics are complicated, though.

joeytakeda commented 4 years ago

Yes, okay--I think we were saying the same things, but I wasn't communicating it properly!

I can add the "Click to view all" feature tonight, if you'd like. Should be simple enough to implement.

martindholmes commented 4 years ago

There are a few things that need to happen to make it workable, including outputting the maxKwics value into the form element so that JS can read it, having the JS read it, having the StaticSearch object pass it on to the ResultSet object, then having the ResultSet object act on it appropriately by injecting the control when it's needed, and not when it's not. I think this work might be best done in a branch, after we've merged the current work into the master branch.

martindholmes commented 4 years ago

Before we do anything, we should first consider whether to split this config item into two. We could have maxKwicsToHarvest and maxKwicsToShow, both defaulting to 10; the first would control how many kwics are harvested for a given token in a given document (unless phrasalSearch is turned on), and the second would control how many kwics are shown by default for any given hit document retrieved by a search. They are actually two different things.

I can see reasons for having them at very different values; imagine that you have a large site and you don't want phrasal search, so you specify that you only want to harvest three kwics per token, to keep the JSON file sizes under control; but once you've pulled all the JSON anyway, if a given document has ten hits in it (from four different search terms, say), you might think there's no reason not to show them all. Conversely, you might want phrasal search, but want search results to be Google-like in terms of how much info you get from each hit document, so you set maxKwicsToShow to 3.

joeytakeda commented 4 years ago

Yes, I think splitting it into two seems like the right idea; I agree that they are separate things.

We'll still have to think through the sorting. Say we have

maxKwicsToShow: 3 maxKwicsToHarvest: 3

And you search (in WEA): little girl said [the three top terms]

and a document has 3 KWICs harvested for each term.

I expect a user would want these interleaved, i.e.

DISPLAYED KWIC 1: little DISPLAYED KWIC 2: girl DISPLAYED KWIC 3: said

But right now only the results for "little" appear in the current set up of 10 displayed: https://jenkins.hcmc.uvic.ca/job/WEA/lastSuccessfulBuild/artifact/products/site/search.html?q=little%20girl%20said Does that qualify as a bug for the current release?

martindholmes commented 4 years ago

I don't think it's a bug, because there's no default expectation for the user, IMHO. We may decide that we have a preferred order, but it's by no means clear what that should be -- interleave seems like it makes sense, but what if there are phrasal searches or MUST_CONTAINS? Aren't they more "important" somehow? What about the order in which the user typed the search terms -- does that represent the order they'd like to see results in?

I'm tempted to say that the default for kwicsToShow should be -1, meaning all of them, but if someone does constrain that value, then we'll have to figure out what sequence makes sense. Bear in mind too that most search systems are dumb about this; search Google for "japanese" "shodou" and look at the results. I think they're just returning the first two kwics whatever they happen to be, presumably in document order. (And actually, document order is the only thing that really makes sense when you think about it; but implementing that would require that we add an attribute to kwics that enabled us to determine it from the JSON.)

joeytakeda commented 4 years ago

I like the idea of document order-- we could assign a number to every <span> that we create and then put that number in the JSON. When we create the result object, we simply organize the kwics by the "order" property (or whatever we call it). But It would increase JSON size even more though and would probably make XSLT significantly slower if we use preceding-sibling:: (though we could probably use accumulators...).

martindholmes commented 4 years ago

The more I think about it, the more the document order approach makes sense. If every kwic comes with a "pos": 10 sort of value, the JS can easily sort them.

However, let's punt this forward to 0.8, because I'd like to experiment with it before we commit to it, because of the file size issue. We can implement it along with the show/hide feature. In the meantime, shall we go ahead with splitting the current maxContents thing into two?

joeytakeda commented 4 years ago

I’m inclined to wait to split the maxContexts thing to 0.8 so that all of the related changes are made in the same release. Seems cleaner to me.

martindholmes commented 4 years ago

I'd like 0.7 to be a usable release, though, so that we can pin some projects to it and leave them alone for a while. The current maxContents thing is really confusing...

martindholmes commented 4 years ago

I've implemented maxKwicsToHarvest and maxKwicsToShow in commit 807c8dc. I'm going to close this ticket and raise another one for the sequencing of kwic results by document order.