sanskrit-lexicon / COLOGNE

Development of http://www.sanskrit-lexicon.uni-koeln.de/
18 stars 3 forks source link

webtc2 too many results #286

Open funderburkjim opened 4 years ago

funderburkjim commented 4 years ago

The webmaster pointed out an issue with advanced search display.

To summarize, he mentioned that there are occasional times when abnormal cpu usage occurs, in conjunction with very long running requests.

He provided an example user url which caused such an event. It involved the Advanced Search.

After some examination, two changes were made in advanced search that (a) fix the specific example query, and (b) provide a limitation that should limit cpu usage for other advanced search queries that the specific fix doesn't handle. Details described in further comments.

funderburkjim commented 4 years ago

Description of problem

Here is a screenshot of the problem, on local XAMPP installation,

image

funderburkjim commented 4 years ago

Why the example code runs so long.

When the example search is initiated, a regular expression is constructed based on the various user choices of settings. Among other things, non-ascii characters of the input are removed; in our case the input is ऐश्वर्य , which has only non-ascii characters, so removing them leaves only the empty string. The result is that the regexp is [^a-zA-Z0-9]()[^a-zA-Z0-9] . This regexp is used to search every line of query_dump. And every line in query_dump having (for example) two spaces would match. Result is that almost every line matches!

Then, since the user is requesting all matches, the number of matching headwords will be almost every headword in the dictionary -- e.g. for mw 200,000 or so.
Finally the program will generate html for all these headwords (probably several hundred megabytes of html) will attempt to be sent to the user's browser.

funderburkjim commented 4 years ago

First fix

The first fix simply checks if, after removing non-ascii characters, the user input is the empty string. If so, the program immediately fails.

Incidentally, the reason for removing characters from the user's input is to attempt to guard against cross-side scripting attacks.

funderburkjim commented 4 years ago

Second Fix

There are almost surely circumstances, in addition to those that the first fix addresses, that could occur in the advanced search where all or too many matches might occur.

The safest way to deal with this is simply to omit the 'all' option for the number of returned results. So, that is what the second fix does. Now, the maximum number of records returned is 1000. Surely, for almost all practical purposes 1000 records is ample.

funderburkjim commented 4 years ago

Installed for all dictionaries

As usual, the changes mentioned above were first made in local copy of csl-websanlexicon repository. Then tested on local server. Finally, the repository was pushed to Github, pulled to sanskrit-lexicon server. Then one dictionary was tested on Cologne server, and when all looked well, the changes were installed for all dictionaries.

The bug should be fixed now.

funderburkjim commented 4 years ago

Here's a screen shot after the fix (using ap90 dictionary this time).

image

gasyoun commented 4 years ago

Now, the maximum number of records returned is 1000.

Bad news. So now we will not even know how much entries are found total.

funderburkjim commented 4 years ago

You could do next to get the next 1000.

gasyoun commented 4 years ago

next to get the next 1000.

That is pain. And I can't get stats at once.

funderburkjim commented 4 years ago

can't get stats at once

True. Why don't you open an issue describing this as enhancement.

While we don't want to return all dictionary entries, it might be possible to return the total number of matches as a separate, new statistic, independent of the records returned.

funderburkjim commented 4 years ago

What are some examples of searches you try that have more than 1000 matches?

gasyoun commented 4 years ago

try that have more than 1000 matches?

ja as suffix.

ja