shadisabzali / dataparksearch

Automatically exported from code.google.com/p/dataparksearch
GNU General Public License v2.0
0 stars 0 forks source link

Sub-string searches #25

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Hi Maxime,

We are using the latest snapshot 4.53 from 2010_01_19 (with stop word regex
expressions, thanks!) on Red Hat 5.4 64-bit with MySQL 5.1.42.

Our MySQL database if roughly 4 millions URLs with 20GB of data and
indexes. We originally started using cache mode, but our users weren't
pleased the results. The root of the issue was no ability to perform
sub-string searches which wasn't supported by cache mode. So we switched
over to MySQL multi-mode without CRCs so that it supports sub-string searches. 

Our indexing speed greatly improved with MySQL, but the search performance
has suffered. We sacrificed speed for relevancy, in our opinion. We are
actually quite disappointed with MySQL and it's inability to support query
parallelization. Our server has multiple cores and lots of memory so we
have loaded all the tables and indexes directly into memory, about 20GBs
worth. We've noticed that individual queries to a single dictionary table
with 75 million rows returns rather quickly, around 3 seconds. This
performance is ok, but we see a chance to improve sub-string searches and
improve performance. Every query that dpsearch issues uses a like statement
and needs to scan the entire index. We realize this is the price we pay to
support sub-string searching. The issue we see is that dpsearch issues a
query to the dict(x) table then waits for the result and issues the next
query to the next dict(x+1) table and waits for the results then dict(x+2)
and so on up to dict32 then combines the results. Does dpsearch have the
ability to issues multiple queries at once, possibly a configurable amount
of queries. This feature would get around MySQL's inability to support
query parallelization and chew up some wasted resources on our server. Any
thoughts or other suggestions to get good performance with sub-string searches.

Thanks for all your hard work. 

Original issue reported on code.google.com by Imlbr...@gmail.com on 27 Jan 2010 at 7:10

GoogleCodeExporter commented 9 years ago
Sorry, this was classified as a defect. It probably a little bit of both,
enhancement/defect.

Original comment by Imlbr...@gmail.com on 27 Jan 2010 at 7:11

GoogleCodeExporter commented 9 years ago
You can use "MakePrefixes yes" command to produce all prefixes for a word. This 
was
implemented for search suggestion, but it could be useful for your. And it can 
be
used with dbmode cache. If you would try, place this command in both 
indexer.conf and
search.htm files.

Which language do you index ? For languages having ispell dictionaries it's 
possible
to do fuzzy search using this data. Usually this give better relevancy for 
results.
See http://www.dataparksearch.org/dpsearch-fuzzy.en.html#ISPELL

Yes, parallel queries for words in index are in my TODO for the next version. 
But
this feature would be implemented first for dbmode cache. At the moment,
DataparkSearch only do merging of read data in parallel (only if it was 
compiled with
phtread support).

Original comment by dp.max...@gmail.com on 28 Jan 2010 at 12:25

GoogleCodeExporter commented 9 years ago
Thanks for the suggestions. I know cache mode is extremely fast for searching, 
but it
caused some issues for us. One of the cache files grew extremely large, 
literally
orders of magnitude larger than any other file. Most of our cache mode files 
were
20-50MB, but one file (I believe wrd01f.* or wrd01e.*) was 250GB and it just 
kept
growing exponentially compared to the other files. This was with 4.52 versions 
and we
were going to run out of disk space. Also, we couldn't see what words had been
indexed which is even more necessary with the new and quite awesome stopword 
regex
feature (Thanks again).

An example of some of the searches that did not return the desired results in 
cache
mode was when the search was for someone's lastname in our organization. The 
user was
expecting to see a hyperlink to a TWiki page with the person's firstname & 
lastname
concatenated together. I briefly looked at the link for fuzzy searches and I 
think it
would help considerably, but possible not in this scenario. From my brief 
review of
the link, fuzzy search can help with many common word variations, but it may 
not help
with our firstname/lastname scenario. Would fuzzy search help in this scenario?

Based on your input, I think we are going to try using postgres with 
partitioning and
take advantage of postgres's ability to perform query parellelization. 
Hopefully, we
can get the search performance we are looking for from the database. It sounds 
like
cache mode is the most popular mode to deploy dpsearch and I understand why you 
would
implement that first. I would really like to see multiple query threads when 
search
queries are performed in multi mode.

Original comment by Imlbr...@gmail.com on 28 Jan 2010 at 4:28

GoogleCodeExporter commented 9 years ago
I don't know how firstname and lastname are concatenated in your case. But 
there are
some hints in dataparksearch ho to boost influence of some document parts on
relevancy. E.g. if those names are enclosed by H1 tags (H2, H3, etc.), you cold
specify a separate section:
Section h1 10 0
Then with &wf CGI-parameter you can increase the weight for this section.

More sophisticated hinting is with an external SQL table, let 
CREATE TABEL bookmark (hint text, url text);
CREATE INDEX bookmark_url ON bookmark (url);
then you could define SQL-based section
SectionSQL db.title 11 0 "SELECT hint FROM bookmark WHERE url='$(URL)'"
and increase the weight of this section in the same way as above.
You put any relevant data into "bookmark" table per URL (and you can have 
several
rows for the same URL), e.g. you can write firstnames and lastnames in 
conventional
way for every TWiki page.

BTW, the documents that have query words in greater number of document sections
usually rank higher in dataparksearch, so the combination of this two methods 
gives
better results.

Original comment by dp.max...@gmail.com on 28 Jan 2010 at 10:22