mvysny / aedict

Original Aedict 2 source codes
http://www.aedict.eu
GNU General Public License v3.0
40 stars 7 forks source link

Ranking/order of entries #56

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
This is more an attempt for a discussion than a specific feature request.
It is almost unbelievable how much functionality was added to aedict,
thanks a lot for that. 
The only issue I have is the order of the results. Currently common words
(marked with "(P)") are sorted before the other results and the order
within common words and other words is done by length of the japanese part
and lexical order (if I have seen that correctly). Unfortunately some
dictionaries that I often use (e.g. the german wadoku dictionary) do not
have annotations for common words "(P)".

I tested to remove the ordering of results in ResultActivity.java and
replaced the double query for common words and other words in
DictTypeEnum.Edict with a simple query. 
During index creation I read the japanese part of the tanaka database and
use the number of matches and partial matches in this database to boost an
entry, beeing a common word ("(P)") gives an extra boost. I attach the code
that I used to create an edict-like index at the end of this description.
There are some magic numbers in the code that I found by playing around, it
is just a "proof of concept", I wanted to consider full matches and partial
matches (but rank longer matches higher than shorter ones). 

Of course index creation takes longer, but I have the impression that the
order is more useful now (but I am not very proficient in Japanese). Just
as an idea, could that be a way to improve the ranking/order of entries ?

BufferedReader edict = new BufferedReader(new FileReader(this.sourceFile));
FileUtils.deleteDirectory(new File(this.outDir));
String tanaka = FileUtils.readFileToString(new File("data/tanaka.txt"));
final IndexWriter luceneWriter = new IndexWriter(this.outDir, new
StandardAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED);
for (String line = edict.readLine(); line != null; line = edict.readLine()) {
  double boost = 0;
  if (line.startsWith("#")) continue;
  final Document doc = new Document();
  doc.add(new Field("contents", line, Field.Store.YES, Field.Index.ANALYZED));
  String word = StringUtils.split(StringUtils.split(line,
"/")[0],"[")[0].trim();
  for (int i = 0; i < Math.max(word.length()-2,1); i++) {
    int countMatches = StringUtils.countMatches(tanaka, word.substring(0,
word.length()-i));
    double part = countMatches * Math.pow(2,0.5*word.length()-2*i);
    boost += part;
  }
  if (StringUtils.countMatches(line, "/(P)/") > 0) {
    boost += 25;
  }
  float finalBoost = 1 + (float) boost/25;
  doc.setBoost(finalBoost);
  luceneWriter.addDocument(doc);
}
luceneWriter.commit();
luceneWriter.close();

Original issue reported on code.google.com by bastian.mathes on 17 Apr 2010 at 5:41

GoogleCodeExporter commented 9 years ago
Hi, thanks for the praise :-)
As for the ordering: this already has been discussed in Issue 39. In short, 
currently 
the ordering algorithm works as follows:

1. common (P) wins
2. otherwise, shortest wins
3. otherwise, more common wins (the commonality index is taken from the 
KANJIDIC 
dictionary, e.g. most common japanese kanjis are: 日一国会人). Katakana 
and hiragana
characters receive commonality of 1, kanji characters get commonality
from the F field of the KANJIDIC + 1. Unknown kanji receive commonality
of 1002. The commonality index of a word is a sum of commonality indices of all 
its 
characters. See KanjiUtils.getCommonality() for details.
4. perform a string comparison

First a bit of explanation. Thanks for showing me the doc.setBoost() method - 
this 
would certainly help to save the CPU power: the client would not have to sort 
the 
results manually. Unfortunately, the method does not work as required by 
Aedict. 
Please try the following program (App.java, attached). I would expect that the 
query(1) call always prints "value 299", regardless of the value of the DIV 
constant. 
Unfortunately, the output varies, depending on the value of the DIV constant. 
This is 
probably caused by Lucene converting the boost value to 8-bit float, losing a 
lot of 
precision. The point is that we cannot rely on the boost sorting.

Thus, unfortunately we have to sort on the client. You're right that the 
sorting 
algorithm fails if there are no annotations for common words. And you're right 
that 
the current commonality algorithm is quite stupid :-) Current sorting algorithm 
is 
very simple, but quite fast, and it works (more or less). If we want to alter 
it, we 
have to be careful not to break Issue 39. So, please feel free to propose a new 
sorting algorithm if a) it doesn't break anything and b) the advantages are 
measurable. Perhaps we can try to compute the boost value for the DictEntry. I 
tried 
the following inverse boost function: (float)getCommonality()/1000 + 
(float)getJapanese().length()/5
it wasn't so bad for the starters, but I guess there is a serious theory 
lurking 
there :-) Perhaps we should inspire ourselves by others dictionaries sorting 
algorithms? E.g. the iPhone kotoba dictionary.

Original comment by martin.v...@gmail.com on 18 Apr 2010 at 1:18

Attachments:

GoogleCodeExporter commented 9 years ago
Sorry, I'm not a native english speaker, so please let me explain what I meant 
by the 
"theory lurking". In short, to create the "boost" function we have to know what 
the 
"best" result is. Of course, the best result is the one which the user is 
currently 
searching for :-) But this is very subjective and unmeasurable. Thus, the boost 
function is not going to work for all searches. Thus, we have to search for an 
approximate function, which provides best results for most searches. I would 
say that 
the function should depend on two things (let's forget the "common" flag for 
now): 
word length and word commonality. But before we start hastily guessing 
constants, 
isn't it better first to exactly define several criteria for the function? 
(e.g. if I 
try to search for "expression" the term "facial expression" must be among first 
10 
results). Then, we can perhaps compute the function constants.
It isn't easy to define the criteria though - there may be different 
methodologies 
(e.g. let's take a first page of japanese newspapers. Searching for english 
translation of all words should return the original japanese words amongst 
first 10 
results - hey, this might actually work! :-) ). To compare all these 
methodologies - 
this is something worth diploma thesis I guess :-). I'll try to ask a friend of 
mine, 
whether he knows about some research in this field.
So, what I'm trying to say is, that from my point of view we cannot simply 
guess 
constants and change (perhaps break) current algorithm, because a lot of users 
seems 
to be satisfied with it.
Which dictionary is missing the (P) flag?

Original comment by martin.v...@gmail.com on 18 Apr 2010 at 1:38

GoogleCodeExporter commented 9 years ago
Thanks for investing time in this issue and pointing out the boost problem. A 
lot of
results in your example have the same score, so you are probably right that too 
much
precision is lost due to lucene only investing one byte for storing the boost
together with other criteria.

Okay, boosting is not sorting and not as exact as sorting, but I would still 
prefer a
boost on occurrences in tanaka compared to sorting by length (that's what is
basically left if there is no common word marker (P) as there is not in the 
edict
version of the german-japanese wadoku dictionary) and I would also prefer to let
lucene do the ranking work.

You are probably also right that there are already a lot of thoughts/theories 
on this
issue and that we should use that. I only use "gjiten" beside aedict and 
(although
being a great application otherwise) it sorts by alphabet which is not what I 
was
looking for. I haven't found source code for kotoba, is it open source at all ?

I do not mind building my own dictionaries and playing around with strange 
boosts but
I would love to be able to use them with an unmodified aedict. Do you think it 
would
be an option to be able to turn off the post-search sorting in the 
configuration of
aedict ? That would give the users great freedom in building own dictionaries 
and
influence the lucene ranking at indexing time in what ever way they like.

Original comment by bastian.mathes on 20 Apr 2010 at 10:33

GoogleCodeExporter commented 9 years ago
As for the kotoba dictionary, I'm not sure whether it is open-source. I even 
don't 
know whether it's free. So, for now, I'll add a configuration option (for 
experts 
which completely turns off client (post-search) sorting, and will use ordering 
as 
provided by Lucene (the boost setting). If you'll find any good theory or 
boosting 
function, please let me know :)

Original comment by martin.v...@gmail.com on 21 Apr 2010 at 6:40

GoogleCodeExporter commented 9 years ago

Original comment by martin.v...@gmail.com on 25 Apr 2010 at 11:24

GoogleCodeExporter commented 9 years ago
This issue was closed by revision a85067449e.

Original comment by martin.v...@gmail.com on 25 Apr 2010 at 11:31

GoogleCodeExporter commented 9 years ago
Sorry for not reacting so long. Thanks a lot for the change, I am looking 
forward to
it. Currently I cannot build with the usual "mvn install" from the repository
("Project ID: null:aedict-common:jar:null" "Reason: Cannot find parent:
sk.baka.aedict:aedict for project: null:aedict-common:jar:null for project
null:aedict-common:jar:null"), so I will be waiting until I can get 1.2 from the
android market.

Thanks...

Original comment by bastian.mathes on 1 May 2010 at 10:10

GoogleCodeExporter commented 9 years ago
Ooops, I am sorry. Please try now - you will have to skip the tests though as 
they may 
fail now: mvn clean install -DskipTests

Original comment by martin.v...@gmail.com on 2 May 2010 at 10:03

GoogleCodeExporter commented 9 years ago
That works, thanks...

Original comment by bastian.mathes on 3 May 2010 at 5:36