projectEndings / staticSearch

A codebase to support a pure JSON search engine requiring no backend for any XHTML5 document collection
https://endings.uvic.ca/staticSearch/docs/index.html
Mozilla Public License 2.0
51 stars 22 forks source link

Wildcard searches #48

Closed wsalesky closed 4 years ago

wsalesky commented 4 years ago

Hi @martindholmes and @joeytakeda I have been experimenting with staticSearch for the past month or so, and am planning to use it with a new project I am working on. Are there any plans to add wildcard search capabilities to the app? This is one of the features requested by my project. I'm happy to investigate adding this option myself if this is not on your radar. Do you have any tips on where I should get started in the code?

I am also adding diacritic insensitive searches to our version of the app, as the project I am working with will be searching over multiple foreign languages. Is this a feature you would be interested in incorporating in the the codebase?

Thanks for a great project (and the copious comments in the code).

martindholmes commented 4 years ago

Hi @wsalesky. Thanks for working with our code! It's great to know someone else is using it. We have it deployed in half a dozen projects at this point, but we still have a lot of work to do on it.

We are planning to add wildcard functionality, and about half of the work is done; we're already generating a json list of all the tokens, which will be the basis for it. What's not really decided yet is how it should interact with the phrasal search; if you do a phrasal search with wildcards, I'm not quite sure what the optimal approach is yet. We'll happily adopt this ticket to track the remaining work that needs to be done on it.

Diacritics are another thing we've thought about. We're not sure whether that sort of thing is best handled by customizing or replacing the stemmers, or whether there's a generic approach that would work across multiple languages. Should it be limited only to diacritics, or should OE thorn and eth be replaced by th, wynn by w, and so on? What about ligatures? We're already figuring out what to do about Early Modern English, since we two major projects using texts from that period, and I think we're tending towards extending or enhancing the stemmers there.

joeytakeda commented 4 years ago

Thanks, @wsalesky, for the ticket! We've been talking about wildcards for some time now, so I'm glad you've brought it up.

We are planning to add wildcard functionality, and about half of the work is done; we're already generating a json list of all the tokens, which will be the basis for it. What's not really decided yet is how it should interact with the phrasal search; if you do a phrasal search with wildcards, I'm not quite sure what the optimal approach is yet. We'll happily adopt this ticket to track the remaining work that needs to be done on it.

Personally, I think that a wildcard search makes no sense within a phrasal and should be simply ignored and taken as literal (which probably won't provide good results). But I think the other types of searching--does not contain and must contain--are also hard. Is it feasible that someone would want to do a search like:

something +a????? -b???????

The easiest approach, of course, is that wildcards are only allowed in a "MAY CONTAIN". That might be too strict, but it would avoid many potential performance problems etc.

martindholmes commented 4 years ago

The biggest problem with the wildcard search is deciding how to constrain a potentially enormous search query. If someone searches for s on the Map of London site, we would have to retrieve 5,369 JSON files, coming in at 48 MB. sa gives us 515 files and 4 MB; sat* is 42 files and 228 KB. We will know when the search is entered how many files will have to be retrieved for any given wildcard token, but we can't know the total size of those files without actually retrieving them, so we have to make some kind of decision based only on the number of matching tokens whether a search is practical or not, and give appropriate feedback.

We could insist on at least three letters in any wildcard token, and a maximum of (say) twenty index files to be retrieved for each token; any search more complicated than that could show an error message to the effect that the number of results is expected to be excessive so the search parameters would need to be refined. Does that sound reasonable?

joeytakeda commented 4 years ago

Could we just send back the maximum number of hits? So if someone does s*, we'd only show the first twenty results, but have a message to the effect of "Showing first 20 results of a possible 515; perhaps refine your search?"?

I'll also note that we could add a file size bit to the processing (i.e. the final step is to weigh all of the JSONs and put those numbers in a map so that we can tell if you're trying to retrieve a gigabyte of stuff, for instance), but we might be better off just limiting by number of file in any case.

martindholmes commented 4 years ago

Without retrieving all the results, we can't know how to rank them, so we'd just be returning a random selection of hits. Of all the s* tokens, we'd have to randomly choose a couple, on no basis at all.

Good idea on the file size, but I'm not sure whether there's any way to act on it. Statistically speaking, a larger file is more likely to contain a higher-value hit, but the true value of a hit document is only computable when you have the index files for all the search tokens, so you have to get them all anyway.

wsalesky commented 4 years ago

@martindholmes @joeytakeda That is great news about the wildcard functionality being already in the works! I would agree with @joeytakeda that wildcards in phrase searching does not really make much sense and would certainly not be a high priority for us.

In terms of diacritics I think it might be useful to have an easy way to plug in user customizations as different languages will have different requirements. The project I am working on spans Arabic, Aramaic, Armenian, Coptic, Georgian, Greek, Hebrew, Latin, Pahlavi, Persian, Syriac, and Turkish, and I'm currently experimenting with using the following:

in tokenizer.xsl: <xsl:value-of select="replace(lower-case(replace(normalize-unicode($word, 'NFD'), '[\p{M}]', '')),'ʿ|ʾ','')"/>

in search.js:


//Step 1 remove ʿ|ʾ
strSearch = strSearch.replace(/(ʿ|ʾ)/g, '');
//Step 2. lowercase
strSearch = strSearch.toLowerCase();
//Step 3 remove diacritics
strSearch = strSearch.normalize("NFD").replace(/[\u0300-\u036f]/g, "");

This allows users to enter a term with our without diacritics and return the same results. This meets most of our needs at the moment but I am still doing testing on the Arabic, which may have some additional requirements. There may be a better way to do this?

Thanks again for your help.

martindholmes commented 4 years ago

@wsalesky We do have a ticket for pluggability https://github.com/projectEndings/staticSearch/issues/40 and we'd appreciate any feedback there. Ideally, nobody should have to patch our actual code to plug in another stemmer or even override a specific function, but we haven't yet figured out the most effective way to handle that.

We hadn't really considered using the codebase for a site that mixes multiple languages, so I'm intrigued at what you're doing here; are you building different search indexes with different stemmers for each language, or are you ignoring stemming for the moment? If you're not stemming, then I can see why wildcards would be essential.

wsalesky commented 4 years ago

@martindholmes We are not stemming, it is too much of a challenge to cover all the languages, some of which may not have much in the way of existing resources (Syriac for example).

It would be great to have some sort of plugin architecture for handling special cases. For example in addition to the NFD string normalization we also need to strip ʿ and ʾ characters from words and there may need to be some special handling of for Arabic and Syriac, I'm still testing the results I have.

martindholmes commented 4 years ago

I'd like to start working on a detailed plan for this. The only code that needs to be changed is the JavaScript, but it's quite a complex thing, so I'd like to start by creating a step-by-step outline of what needs to be done.

Ideally, the whole thing is handled in a pre-processing step which (in the background) cleans up the search string (removing e.g. quotes from around wildcard-containing strings, since we won't support wildcards in phrasal searches), expands the wildcard into a set of may-contain single terms, and either aborts with a message if that's too complicated a search (e.g. there are fifty terms), or invokes the regular search based on the expanded search string. This needs:

How does this look?

joeytakeda commented 4 years ago

This all sounds right to me!

To address this:

what do we do if a phrasal search legitimately includes a question mark?

I think we just want to ignore all wildcard operators in the phrasal search, so I think we just ignore it here too.

martindholmes commented 4 years ago

I think the intention was that if you enter this:

"one fin* day"

we would split it into three tokens:

one fin* day

and search for them. But if you enter:

"one fine day?"

should we conclude that you intend a phrasal search including a question mark (which is perfectly OK, no?), or that you're trying to use a wildcard?

Cheers, Martin

martindholmes commented 4 years ago

@joeytakeda I think I finally understood what you're suggesting, and of course it's perfectly correct: we always honour the quotation marks, so if you search for

"one fin* day"

we do a phrasal search for exactly that, including the asterisk. Problem solved.

Next problem: how on earth do we judge whether a wildcard search is going to result in an explosion of results? These, for instance:

*tion (Do we ban leading wildcards? What about languages with leading inflections?)

th* (Do we set a minimum length for the literal component? What about CJK languages where a "word" is two components? Perhaps they would need their own rules, so this has to be pluggable.)

joeytakeda commented 4 years ago

I think your above suggestion of a default maximum of, say, 10 or 20 result tokens--this would be configurable--is probably right and, in fact, perhaps the only measure we'd want to take for preventing too many results.

Rather than make an arbitrary length requirement (2 or 3 characters that aren't wildcards), we instead can just limit the wildcard searches by how many result tokens it returns. Something like th* would probably be an explosion of results (and would thus say "Sorry, too many results were returned. Try something more specific"), but something like x* or *q or qu* or *mt would, for some sites, only return a few hits and so should be a permissible search.

joeytakeda commented 4 years ago

As per meeting today between @martindholmes and @joeytakeda, we've decided on the following approach:

martindholmes commented 4 years ago

Work is under way in branch issue_48_wildcards.

martindholmes commented 4 years ago

Addition decisions: Add a configuration element that says whether you want wildcards or not in your search; if you do, then a) we will add file weight to the ssTokens file because we'll need it, but b) we will not bother to download that file in the search page because you won't need it.

martindholmes commented 4 years ago

@joeytakeda I have this basically working, and I'd be grateful if you'd play around with it and see what you think. The only constraint it has right now is a default limit of 1MB on the JSON resulting from each wildcard term; there are no sanity checks at all on the wildcard itself, so you can search for *e* if you want, which is amusing but not ultimately desirable. I very quickly came across the limitation I was expecting; if you search for pu*le in the test data, you will not find puzzle, because the term is stemmed to puzzl in ssTokens and therefore doesn't match. We can't pass pu*le to the stemmer to get the final e lopped off because the stemmer is not expecting wildcards. There may be a way around that, or alternatively we may just live with the limitation.

joeytakeda commented 4 years ago

Fabulous! I was testing with the emod stuff (ant -Dconfig=config_emod.xml) since I couldn't think of many good test cases in our standard test set. The EMOD set doesn't have any filters configured, so I can't speak to how it interacts with filters yet.

Some of the searches I tried were:

lo*e lo[uv]e l* [vv|w]o* hono?r 16[0-9][0-9] (just to see if it works with numbers And with character classes and it does!) ha*y (to test the pu*le problem to try and retrieve both happy and Harvey) "Fame fixt vpon my Head?" (to test exact phrase)

A few questions/bugs:

lo[uv]e: staticSearch/emod/search.html?q=lo%5Buv%5De loe: staticSearch/emod/search.html?q=loe lo?e: staticSearch/emod/search.html?q=lo?e

(though the '?' and '*' in the last two should probably be escaped, so not to interfere with the URL params, I'd imagine).

But if you go to those URLs in your local environment, they will populate the input box, but it returns a fetch error:

VM117:1 GET http://localhost/staticSearch/emod/undefinedssTokens.json 404 (Not Found)
ssSearch.js:508 Error attempting to retrieve token list: SyntaxError: Unexpected token < in JSON at position 0
<mark>Fame fixt vpon my Head?</mark>? beneath me, round,...

It yields another question mark for some reason; and if you change the query to be

"Fame FiXt VPon mY hEAd" (no wildcard this time)

You get:

<mark>Fame FiXt VPon mY hEAd</mark>? beneath me, round,...

That looks to be a pre-existing bug though (http://johnkeats.uvic.ca/search.html?q=%22iN%20thE%20ICY%20sileNCE%20of%20the%20TOMB%22), so I'll file a different ticket.

martindholmes commented 4 years ago
martindholmes commented 4 years ago

Fixed the period issue. I have a thought about the weight limit thing: So far we had vaguely thought that we might show an error message if a search is too large and complicated to be done. That's slightly complicated because it requires explanatory language and a method of displaying it. Instead, how would it be if, having reached the weight limit for a specific wildcard term, we simply stop adding new terms to the search? If we've hit the weight limit already, the chances are that the size of the result set is going to be enormous anyway.

martindholmes commented 4 years ago

The weird capitalization issue should now be fixed in commit #086e010 (in this branch).

martindholmes commented 4 years ago

Implemented the silent weight limit, and added a wildcard search to the test set.

martindholmes commented 4 years ago

@joeytakeda I can't reproduce the problem with the browser history. Could you specify which browsers you see this in?

joeytakeda commented 4 years ago

@martindholmes Both Firefox (76.0.1) and Chrome (83.0.4103.61) fail to perform this search when instigated by the URL (i.e. if you just go to the URL rather than typing it in):

http://localhost/staticSearch/emod/search.html?q=lo%5Buv%5De (I also tried with a python simple server, and same thing)

Firefox says:

Error attempting to retrieve token list: SyntaxError: JSON.parse: unexpected character at line 1 column 1 of the JSON data ssSearch.js:508:23
martindholmes commented 4 years ago

@joeytakeda I can't build the emod stuff because it's got a broken config file; it's not been updated from @xpath to @match. How are you getting it to work? I'm going to fix that config file in this branch. Also, it doesn't have wildcard searching turned on in the config file yet.

joeytakeda commented 4 years ago

@martindholmes weird: I thought I had committed all of the valid config files in this commit: https://github.com/projectEndings/staticSearch/commit/c73efb54f2e3aca45558eb0379a702f4d1e6be42 but what I suppose happened was that I changed all of them in one go (since I haven't touched the config_emod recently other than to add yesterday), but I only added config_moeml (which should be deleted) and not config_emod and so when I committed the change didn't go through.

martindholmes commented 4 years ago

@joeytakeda Having fixed the emod config file and rebuilt, it works fine for me in FF, Opera, Chromium and Chrome. Did you perhaps commit your changes to the dev branch instead of this one?

martindholmes commented 4 years ago

I'm going to commit fixes in this branch if that's OK.

martindholmes commented 4 years ago

I have one remaining bug, which is that a search fails to run if it's initiated cold from a browser URL. This is something to do with object references -- it's trying to retrieve undefinedssTokensundefined.json in the doSearch() function. Working on it...

martindholmes commented 4 years ago

Fixed in commit #98bbf88. I believe we're now ready to merge this into dev. @joeytakeda Can you think of anything remaining besides documentation? If not, let's merge, and see what the big project builds throw up.

martindholmes commented 4 years ago

After switching the Keats branch and testing, I've hit another issue that I think we should deal with: If you wildcard-search for a name (using capitalization), you won't find it, because it's stored unstemmed. If you do the same search with a lower-case version, it works. Options are:

  1. Make the wildcard functionality case-insensitive. I've done that provisionally anyway, solving the immediate problem.
  2. Make a much bigger decision which I think is looming anyway: use a wordlist of all distinct forms, rather than the stemmed token list, for wildcard searching. How big would it be? Need to do some preliminary testing there.
joeytakeda commented 4 years ago

Make a much bigger decision which I think is looming anyway: use a wordlist of all distinct forms, rather than the stemmed token list, for wildcard searching. How big would it be? Need to do some preliminary testing there.

I've been thinking about this too. It's definitely worth testing, I think. The tokens file for MoEML is currently 1.3MB (411kB gzipped). While I wouldn't be surprised at all if the file ends up being 2-3x the size on disk, the actual file size received by the user over the server may not be much bigger at all, since the bulk of what we'll be adding are strings that are very similar to each other, which gzipping can compress nicely.

joeytakeda commented 4 years ago

Just experimenting locally with the WEA, which yields some interesting results:

File Disk Size gzip
No Forms 270kb 102kb
With Forms 880kb 195kb

So the file size itself is 3.26x larger, but only 1.91x larger gzipped.

In terms of what would need to happen if we decide to implement this, it's a fairly minor change at the XSLT stage: retrieve the JSON from the URI, get all of the distinct forms, and put them in an array:

<map:array key="{$thisToken}">
    <map:map>
        <map:number key="size"><xsl:value-of select="$thisSize"/></map:number>
        <map:array key="forms">
            <xsl:variable name="thisJson" select="unparsed-text($thisUri) => json-to-xml()"/>
            <xsl:for-each select="distinct-values($thisJson//map:string[@key='form']/string(.))">
                <map:string><xsl:value-of select="."/></map:string>
            </xsl:for-each>
        </map:array>
    </map:map>    
</map:array>

The resultant JSON looks like so:

    "feel": [
        {
            "size": 116565,
            "forms": [
                "feel",
                "feeling",
                "feelings",
                "Feeling",
                "feels",
                "Feel"
            ]
        }
    ]

We could change this of course, but at the very least, I think the above results are fairly promising. What do you think?

martindholmes commented 4 years ago

I'd actually imagined something way simpler: just an array of all the forms. There's no point in putting this information in the token files because we need it before we retrieve the token files. What we'd do is:

Assuming the forms file isn't too huge, this wouldn't add much to the query parsing, and it would solve the pu*le problem. We could also figure out some cunning strategies to avoid testing against forms one by one; we could construct a single massive string out of it, with delimiters, and match in a single operation against that.

joeytakeda commented 4 years ago

This is the ssTokens file; so what I'm suggesting is that rather than get two files (a forms file and the tokens file), we just have the one ssTokens file which contains all of that information. So look through the ssTokens file, as soon as one matches, get the token file for it up until the max size limit.

martindholmes commented 4 years ago

I get that, but the structure will then preclude cunning lookups. We could actually eschew JSON and use a single file looking like this: feel_feeling_feelings_Feeling... against which we could run a single regex and get an array of matches back. We could also substantially reduce the weight by deciding that all wildcard matches are case-insensitive (as they are in this current implementation), so that file would be smaller. I'm inclined to go that way. I also realize that if we do this, we don't need the ssTokens file for anything other than weight. If we can think of a better constraint method, this could be much faster anyway.

joeytakeda commented 4 years ago

Ah that makes sense; the only issue is then that we would have to stem each form repeatedly, but I'm not sure what the overhead on that would be.

Could we get the values from the response headers and use that to track the download size instead of using the tokens file?

martindholmes commented 4 years ago

I think the overhead of establishing a connection for all those headers would be more significant than a download.

martindholmes commented 4 years ago

The MoEML release is done, so I'm going to merge what we have so far into dev, and see how things work with the various projects. Meanwhile, on this branch, I think we should try the option of a single unique forms text file.

martindholmes commented 4 years ago

@joeytakeda Having had a chance to test out wildcards on the MoEML site, I think the pu*le problem is actually more serious and more evident than we hoped, and I think we have to handle it.

I think the way forward is to provide the list-of-unique-words as a string, as we've briefly discussed, and match against that. Given a string:

|antimatter||attempt||attend||attest||attesting||automobile|

(We need double delimiters in order to that contiguous items will both get returned.) Then the input

att*

would give us a regex constructed as:

/\|(att.*?)\|/gi

(Note that we can't use lookbehinds because they're not supported yet in e.g. Safari.)

We would get back a nested array (JSON.stringified here) in which the first item in each pair is the complete match and the second is the captured word:

"[[\"|attempt|\",\"attempt\"],[\"|attend|\",\"attend\"],[\"|attest|\",\"attest\"],[\"|attesting|\",\"attesting\"]]"

from which we take [0][1], [1][1], [2][1] etc:

attempt, attend, attest, attesting

and we stem each of those; then we retrieve everything from the unique stems.

We can still use ssTokens (if we want) to check the weight limit.

This may seem less elegant than an approach which splits the words into an array and maps or filters the array, but I think it's going to be way faster; obviously we should check that, but a single regex operation against a single string must surely beat an iterated process.

The only question I can see right now is: how big will that string of unique words get?

Can you see any flaws here?

martindholmes commented 4 years ago

One additional thought: if the string-of-words (ssStringOfWords.txt?) turns out to be very large, we could break it up by initial letter and retrieve based on the search query. That doesn't help with leading wildcards, of course; it might be cleaner just to get the whole thing.

martindholmes commented 4 years ago

A working implementation as outlined above is basically finished and working at commit 6a4c0d4. Needs testing and documentation before rolling into dev.

martindholmes commented 4 years ago

I've done a bunch of experimentation and tweaking, and I think I have the balance right now; I've based the new approach on a sort of phrasal search consisting of a single word (as we did before), but starting from the initial lookup in the ssWordString resource. I've merged this work into the dev branch so we can see its effects in the larger projects, although I did test locally with MoEML and the results were great.

The one remaining issue is that in order to get sensible results, you really have to have phrasal searching turned on. We could fork the lookup process so that if you don't have phrasal, it defaults to the much cruder method of finding the word, stemming it, and treating all the stem results as matches, but that's quite unintuitive. We could make it a condition of wildcard support that you also have phrasal support.

martindholmes commented 4 years ago

I believe these are the only remaining things on this ticket:

  1. Make the minimum number of literal characters allowed in a search term configurable. Currently this is set to 3, which is appropriate for English, but obviously won't be for other languages. This is currently hard-coded in the JS file.

  2. Make the maximum number of stem files retrieved for each search configurable. Currently this is set to 50, hard-coded in the JS file.

martindholmes commented 4 years ago

The charsRequired and termLimit constraints are now configurable in the JS, although there is no handling yet for them in the XSLT.

wsalesky commented 4 years ago

@martindholmes and @joeytakeda thanks for all the work on this, I'm looking forward to testing out the new code with our Syriac language texts.

martindholmes commented 4 years ago

@wsalesky Please let us know how it goes! I have a default minimum length of 3 literal characters set for any wildcard term right now, but if that's too high for Syriac, I can show you how to change it.

martindholmes commented 4 years ago

I believe this is now implemented and has been thoroughly tested in live projects; any further work would be enhancement or bugfixing, so I'm closing this ticket.

wsalesky commented 4 years ago

@martindholmes @joeytakeda Finally had a chance to test this and the custom stemmer option, both work really well for our project. Thanks for all the updates!

martindholmes commented 4 years ago

I do think it would be a good idea to add a diacritic-insensitive search. We'd want it to be something generic, though, stripping all diacritics rather than just those from specific languages. How practical do you think that would be?