Closed GoogleCodeExporter closed 9 years ago
Hi, thanks for the praise :-)
As for the ordering: this already has been discussed in Issue 39. In short,
currently
the ordering algorithm works as follows:
1. common (P) wins
2. otherwise, shortest wins
3. otherwise, more common wins (the commonality index is taken from the
KANJIDIC
dictionary, e.g. most common japanese kanjis are: 日一国会人). Katakana
and hiragana
characters receive commonality of 1, kanji characters get commonality
from the F field of the KANJIDIC + 1. Unknown kanji receive commonality
of 1002. The commonality index of a word is a sum of commonality indices of all
its
characters. See KanjiUtils.getCommonality() for details.
4. perform a string comparison
First a bit of explanation. Thanks for showing me the doc.setBoost() method -
this
would certainly help to save the CPU power: the client would not have to sort
the
results manually. Unfortunately, the method does not work as required by
Aedict.
Please try the following program (App.java, attached). I would expect that the
query(1) call always prints "value 299", regardless of the value of the DIV
constant.
Unfortunately, the output varies, depending on the value of the DIV constant.
This is
probably caused by Lucene converting the boost value to 8-bit float, losing a
lot of
precision. The point is that we cannot rely on the boost sorting.
Thus, unfortunately we have to sort on the client. You're right that the
sorting
algorithm fails if there are no annotations for common words. And you're right
that
the current commonality algorithm is quite stupid :-) Current sorting algorithm
is
very simple, but quite fast, and it works (more or less). If we want to alter
it, we
have to be careful not to break Issue 39. So, please feel free to propose a new
sorting algorithm if a) it doesn't break anything and b) the advantages are
measurable. Perhaps we can try to compute the boost value for the DictEntry. I
tried
the following inverse boost function: (float)getCommonality()/1000 +
(float)getJapanese().length()/5
it wasn't so bad for the starters, but I guess there is a serious theory
lurking
there :-) Perhaps we should inspire ourselves by others dictionaries sorting
algorithms? E.g. the iPhone kotoba dictionary.
Original comment by martin.v...@gmail.com
on 18 Apr 2010 at 1:18
Attachments:
Sorry, I'm not a native english speaker, so please let me explain what I meant
by the
"theory lurking". In short, to create the "boost" function we have to know what
the
"best" result is. Of course, the best result is the one which the user is
currently
searching for :-) But this is very subjective and unmeasurable. Thus, the boost
function is not going to work for all searches. Thus, we have to search for an
approximate function, which provides best results for most searches. I would
say that
the function should depend on two things (let's forget the "common" flag for
now):
word length and word commonality. But before we start hastily guessing
constants,
isn't it better first to exactly define several criteria for the function?
(e.g. if I
try to search for "expression" the term "facial expression" must be among first
10
results). Then, we can perhaps compute the function constants.
It isn't easy to define the criteria though - there may be different
methodologies
(e.g. let's take a first page of japanese newspapers. Searching for english
translation of all words should return the original japanese words amongst
first 10
results - hey, this might actually work! :-) ). To compare all these
methodologies -
this is something worth diploma thesis I guess :-). I'll try to ask a friend of
mine,
whether he knows about some research in this field.
So, what I'm trying to say is, that from my point of view we cannot simply
guess
constants and change (perhaps break) current algorithm, because a lot of users
seems
to be satisfied with it.
Which dictionary is missing the (P) flag?
Original comment by martin.v...@gmail.com
on 18 Apr 2010 at 1:38
Thanks for investing time in this issue and pointing out the boost problem. A
lot of
results in your example have the same score, so you are probably right that too
much
precision is lost due to lucene only investing one byte for storing the boost
together with other criteria.
Okay, boosting is not sorting and not as exact as sorting, but I would still
prefer a
boost on occurrences in tanaka compared to sorting by length (that's what is
basically left if there is no common word marker (P) as there is not in the
edict
version of the german-japanese wadoku dictionary) and I would also prefer to let
lucene do the ranking work.
You are probably also right that there are already a lot of thoughts/theories
on this
issue and that we should use that. I only use "gjiten" beside aedict and
(although
being a great application otherwise) it sorts by alphabet which is not what I
was
looking for. I haven't found source code for kotoba, is it open source at all ?
I do not mind building my own dictionaries and playing around with strange
boosts but
I would love to be able to use them with an unmodified aedict. Do you think it
would
be an option to be able to turn off the post-search sorting in the
configuration of
aedict ? That would give the users great freedom in building own dictionaries
and
influence the lucene ranking at indexing time in what ever way they like.
Original comment by bastian.mathes
on 20 Apr 2010 at 10:33
As for the kotoba dictionary, I'm not sure whether it is open-source. I even
don't
know whether it's free. So, for now, I'll add a configuration option (for
experts
which completely turns off client (post-search) sorting, and will use ordering
as
provided by Lucene (the boost setting). If you'll find any good theory or
boosting
function, please let me know :)
Original comment by martin.v...@gmail.com
on 21 Apr 2010 at 6:40
Original comment by martin.v...@gmail.com
on 25 Apr 2010 at 11:24
This issue was closed by revision a85067449e.
Original comment by martin.v...@gmail.com
on 25 Apr 2010 at 11:31
Sorry for not reacting so long. Thanks a lot for the change, I am looking
forward to
it. Currently I cannot build with the usual "mvn install" from the repository
("Project ID: null:aedict-common:jar:null" "Reason: Cannot find parent:
sk.baka.aedict:aedict for project: null:aedict-common:jar:null for project
null:aedict-common:jar:null"), so I will be waiting until I can get 1.2 from the
android market.
Thanks...
Original comment by bastian.mathes
on 1 May 2010 at 10:10
Ooops, I am sorry. Please try now - you will have to skip the tests though as
they may
fail now: mvn clean install -DskipTests
Original comment by martin.v...@gmail.com
on 2 May 2010 at 10:03
That works, thanks...
Original comment by bastian.mathes
on 3 May 2010 at 5:36
Original issue reported on code.google.com by
bastian.mathes
on 17 Apr 2010 at 5:41