Open GoogleCodeExporter opened 8 years ago
Sorry, this was classified as a defect. It probably a little bit of both,
enhancement/defect.
Original comment by Imlbr...@gmail.com
on 27 Jan 2010 at 7:11
You can use "MakePrefixes yes" command to produce all prefixes for a word. This
was
implemented for search suggestion, but it could be useful for your. And it can
be
used with dbmode cache. If you would try, place this command in both
indexer.conf and
search.htm files.
Which language do you index ? For languages having ispell dictionaries it's
possible
to do fuzzy search using this data. Usually this give better relevancy for
results.
See http://www.dataparksearch.org/dpsearch-fuzzy.en.html#ISPELL
Yes, parallel queries for words in index are in my TODO for the next version.
But
this feature would be implemented first for dbmode cache. At the moment,
DataparkSearch only do merging of read data in parallel (only if it was
compiled with
phtread support).
Original comment by dp.max...@gmail.com
on 28 Jan 2010 at 12:25
Thanks for the suggestions. I know cache mode is extremely fast for searching,
but it
caused some issues for us. One of the cache files grew extremely large,
literally
orders of magnitude larger than any other file. Most of our cache mode files
were
20-50MB, but one file (I believe wrd01f.* or wrd01e.*) was 250GB and it just
kept
growing exponentially compared to the other files. This was with 4.52 versions
and we
were going to run out of disk space. Also, we couldn't see what words had been
indexed which is even more necessary with the new and quite awesome stopword
regex
feature (Thanks again).
An example of some of the searches that did not return the desired results in
cache
mode was when the search was for someone's lastname in our organization. The
user was
expecting to see a hyperlink to a TWiki page with the person's firstname &
lastname
concatenated together. I briefly looked at the link for fuzzy searches and I
think it
would help considerably, but possible not in this scenario. From my brief
review of
the link, fuzzy search can help with many common word variations, but it may
not help
with our firstname/lastname scenario. Would fuzzy search help in this scenario?
Based on your input, I think we are going to try using postgres with
partitioning and
take advantage of postgres's ability to perform query parellelization.
Hopefully, we
can get the search performance we are looking for from the database. It sounds
like
cache mode is the most popular mode to deploy dpsearch and I understand why you
would
implement that first. I would really like to see multiple query threads when
search
queries are performed in multi mode.
Original comment by Imlbr...@gmail.com
on 28 Jan 2010 at 4:28
I don't know how firstname and lastname are concatenated in your case. But
there are
some hints in dataparksearch ho to boost influence of some document parts on
relevancy. E.g. if those names are enclosed by H1 tags (H2, H3, etc.), you cold
specify a separate section:
Section h1 10 0
Then with &wf CGI-parameter you can increase the weight for this section.
More sophisticated hinting is with an external SQL table, let
CREATE TABEL bookmark (hint text, url text);
CREATE INDEX bookmark_url ON bookmark (url);
then you could define SQL-based section
SectionSQL db.title 11 0 "SELECT hint FROM bookmark WHERE url='$(URL)'"
and increase the weight of this section in the same way as above.
You put any relevant data into "bookmark" table per URL (and you can have
several
rows for the same URL), e.g. you can write firstnames and lastnames in
conventional
way for every TWiki page.
BTW, the documents that have query words in greater number of document sections
usually rank higher in dataparksearch, so the combination of this two methods
gives
better results.
Original comment by dp.max...@gmail.com
on 28 Jan 2010 at 10:22
Original issue reported on code.google.com by
Imlbr...@gmail.com
on 27 Jan 2010 at 7:10