ACTION_SEARCH_JMDICT API: Option to include reference to matched section of query

timrae commented 9 years ago

When an inexact / compound query is searched, it would be useful to be able to match the result back to the query.

Example: If I search for 寿司が食べたい then I'd want to get back 寿司、が、食べたい together with the search results 寿司、が、食べる. The start and stop indices of the match would work as well.

timrae commented 9 years ago

~~Another option could be to just include the verb inflections separately like kuromoji does. For example 寿司が食べたい。 returns:~~

~~Surface form Part-of-Speech Base form Reading Pronunciation 寿司名詞,一般,, 寿司スシスシが助詞,格助詞,一般,* がガガ食べ動詞,自立,, 食べるタベタベたい助動詞,,,* たいタイタイ。記号,句点,, 。。。~~

timrae commented 9 years ago

After actually spending quite a bit of time working with kuromoji and furigana, I've found that with the way they do it, it's a bit of a pain reassembling the words, so please just ignore my last comment. Finally I think giving two indices (start, stop) for each entry will be the most convenient for me:

e.g. the following sentence 寿司が食べたい。 should return: 寿司 (0, 2) が (2, 3) 食べる (3, 7)

Do you think this is something which you could add to the API relatively quickly? It's kind of crucial for me to proceed with my application

mvysny commented 9 years ago

Hi Tim, thanks for the feature request. I will revisit the API and I will let you know. Don't hold your breath though, this will be done next week soonest... sorry.

mvysny commented 9 years ago

Just to confirm: you use the ACTION_SEARCH_JMDICT api with "kanjis" set to "寿司が食べたい" and "return_results" set to true. Is it okay if in the resulting list of maps, each map would contain e.g. "origin_range" with the format of 0,2 (that is, start index, end index, no braces)?

timrae commented 9 years ago

Yes, great! On 14/05/2015 11:22 pm, "Martin Vysny" notifications@github.com wrote:

Just to confirm: you use the ACTION_SEARCH_JMDICT api with "kanjis" set to "寿司が食べたい" and "return_results" set to true. Is it okay if in the resulting list of maps, each map would contain e.g. "origin_range" with the format of 0,2 (start index, end index)

— Reply to this email directly or view it on GitHub https://github.com/mvysny/aedict/issues/497#issuecomment-102049857.

mvysny commented 9 years ago

Fixed in Aedict 3.19

mvysny commented 9 years ago

The key will be called "position_in_sentence".

mvysny commented 9 years ago

Tim, can you please share a link to your application if it is on the Google Play? I'm quite interested.

timrae commented 9 years ago

@mvysny It's not on Google Play yet, I mainly just made it for myself to be honest, but OK I'll try and upload it sometime this week.

mvysny commented 9 years ago

If you do not wish to go public, no problem. In such case if this is okay with you, you can just send me the APK via e-mail. Thanks!

timrae commented 9 years ago

Probably I'll make it available via the beta testing facilities on Google Play, just give me a few days.

timrae commented 9 years ago

This doesn't appear to be working (I'm currently using a different analysis engine because I required this PR to proceed with Aedict)... I just sent the following query taken from Wikipedia:

漢字（かんじ）は、古代中国に発祥を持つ文字。古代において中国から日本、朝鮮、ベトナムなど周辺諸国にも伝わり、その形態・機能を利用して日本語など各地の言語の表記にも使われている（ただし、現在は漢字表記を廃している言語もある。日本の漢字については日本における漢字を参照）。

The first hit has "position_in_sentence" -> "127,2" which is obviously wrong... others were wrong as well

I also tried a simpler search: 漢字難しいよ and got back: ("kanji" -> "漢字", "position_in_sentence" -> "0,2") ("kanji" -> "難しい, 六借しい, 六ヶ敷い", "position_in_sentence" -> "2,3") ("kanji" -> "よ", "position_in_sentence" -> "5,1")

Whereas I'd expect to get back: "0,2" "2,5" "5,6"

mvysny commented 9 years ago

The position of "127,2" is obviously wrong, I'll look at it. Regarding the simpler search: the second digit is actually the length of the matched string, so if you transcribe the 5,1 into the start,end notation then you will get 5,6.

timrae commented 9 years ago

Ah! Thanks!!

mvysny commented 9 years ago

Hmm, I just tried the long long sentence from the wiki and the analyzer got it right: first hit was 漢字: かんじ with the range of 0,2... Can you please let me know which word had the position of 127,2 (which is 127,129 translated to the start,end notation).

timrae commented 9 years ago

You can see here in the debugger... Item 0 in the list of results from Aedict has the indices 127,2 untitled

timrae commented 9 years ago

Here is the exact string getting sent through the (sk.baka.aedict3.action.ACTION_SEARCH_JMDICT) intent: 漢字（かんじ）は、古代中国に発祥を持つ文字。古代において中国から日本、朝鮮、ベトナムなど周辺諸国にも伝わり、その形態・機能を利用して日本語など各地の言語の表記にも使われている（ただし、現在は漢字表記を廃している言語もある。日本の漢字については日本における漢字を参照）。

I'm using Aedict v3.25

timrae commented 9 years ago

It seems to be working fine with the _NOUI intent... Do they have different code paths?

mvysny commented 9 years ago

Yes, the _NOUI intent is handled by a different (invisible) Activity, but the search engine should be the same... Let me check the UI version.

mvysny commented 9 years ago

Gotcha. 漢字;かんじ was present multiple times in the sentence; the 127,2 was the last location. Fixed in Aedict 3.26 so that the first 漢字;かんじ will receive the correct location of 0,2 and the last 漢字;かんじ will receive 127,2

timrae commented 9 years ago

Hmmm, why was the NOUI intent returning the correct result though?

mvysny commented 9 years ago

The NOUI intent grabbed the result and fed it directly to the intent. The UI intent grabbed the result, transformed it into displayable list, displayed it, then transformed it into something that could be exported and fed that into the intent. The transformation at some point used a HashMap to retain the original information, including the original sentence position. Weird, I know, but currently the implementation is as this :)

timrae commented 9 years ago

Ah I see, thanks! I'll use the NOUI version, which seems like it should be more reliable in general.

timrae commented 9 years ago

By the way, I can see a ton of results like JMDICT: Query jp:WでW produced 1 results (result size was limited to 1) in the catlog...

It's standard practice to refrain from printing all but the absolute necessary amount of logs in the released version of an app, as it can slow things down quite a lot. In AnkiDroid we use a library called Timber to disable all except warning and error level logs in the release version. It's also possible to filter them out automatically with proguard apparently.

timrae commented 9 years ago

@mvysny You can get an APK for my very simple app here: https://github.com/timrae/rikaidroid/releases

It's using an online engine for the sentence analysis instead of Aedict due to performance reasons. However tapping on any of the analyzed words opens the word in Aedict. If I can get the sentence analysis with Aedict to work better and faster then I'd like to use that instead.

mvysny commented 9 years ago

Well, the sentence analysis is a tedious process and the offline analysis will be slower than online analysis (unless you are running some flagship phone), because of way slower hardware. Just out of curiosity: which online service are you using for the sentence analysis?

mvysny / aedict

ACTION_SEARCH_JMDICT API: Option to include reference to matched section of query #497