ripose-jp / Memento

An mpv-based video player for studying Japanese
https://ripose-jp.github.io/Memento/
GNU General Public License v2.0
472 stars 22 forks source link

Replace mecab with a deconjugation parser and display inflection explanations on definitions #210

Closed spacehamster closed 4 months ago

spacehamster commented 5 months ago

Related to https://github.com/ripose-jp/Memento/issues/109

The feature is still in early stages and I'm opening a pull request to discuss it further.

I've created a graph to give a very quick broad overview of how the deconjugator currently functions.

DeconjugationGraph

It works by trying to find a path from any node to a terminal node (godan verb, ichidan verb, irregular verb or adjective). This graph only shows the godan verb for simplicity.

The rules were added ad-hoc as I encountered examples of them, so they aren't currently comprehensive.

One option moving forward is that instead of specifying each transition one by one (such as { u8"せる", u8"せられる", WordForm::causative, WordForm::potentialPassive } for causative < passive), a hidden transition can be added ({ u8"せる", u8"せる", WordForm::causative, WordForm::ichidanVerb},), so that any conjugations that can apply to ichidan verbs can be applied to the causative form, but this allows deconjugation of forms you'd never see in real life such as 話されさせる (passive < causative), which is in line with how yomichan seems to work.

It produces redundant queries to the database (eg 食べています will query 食べる multiple times for the derivations "-te < progressive or perfect < polite", "-te" and "masu stem")

This seems like an easy thing to fix by buffering queries and removing duplicates. I already do this when handling MeCab queries, so it wouldn't be a big problem to deal with.

I didn't expect this to be difficult, I haven't given it much consideration because it's low on the priorities and it ties into a few other points mentioned later. it seems to be a matter of determining which part of the code is responsible for filtering duplicates: the deconjugator itself, the dictionary before it queries to the db, or the dictionary after it receives terms from the db.

Order of results is unintuitive, returning 言う "Potential < Negative" before 言える "Negative" for the query 言えない.

As long as it returns both possible results, either in two separate search results or by showing something like "Potential < Negative or Negative", it's fine.

I picked this example because jmdict has two separate entries for 言う and 言える, so they should be presented as two separate search results. My concern was that the longer dictionary form should be shown first because it is more likely to be relevant. It might be more clear with the example 成り立ち in the sentence この学校の成り立ちをお話しましょう, the noun form 成り立ち is more relevant, but the verb form 成り立つ is shown first with the masu stem deconjugation. In this example the dictionary form is the same length so results of the same length from the ExactWorker should be priotized over the DeconjugationWorker.

Deconjugated queries will match nouns, need to be filtered somehow. Are terms guaranteed to have part-of-speech information attached?

I'd need to see an example of what you're talking about. It would be awkward for conjugation information to show up on a search result for 歩 when the query was 歩けない. It would not be strange for 歩 to show up though. Rather it's preferable so long as it appears below 歩く.

If you search しよう (to do, volitional), it will attempt to deconjugate しよう to しる using the volitional rule, and then will return 汁 (juice/sap​). This would be a valid result if 汁 was a verb but it's a noun. Another example is けれど to く (imperative) and finds the noun 句. If part of speech tags aren't guaranteed, terms without okurigana can be assumed to be nouns and filtered that way.

Long inflection explanations are clipped by the anki buttons (see image). I don't know enough about QT UI to solve this, is it possible to put them on a new line only when they're too long, or should they be allowed to clip and have a mouse over tooltip so they can still be read?

I'm going to have you put inflection explainations on its own line between the word and the term tags. Using a label with word wrap enabled should suffice. I'm worried about the color of the text since it should theme well while still being more contrasty than gray on gray. We can work on all this in the PR though. Dealing with the UI is the easy part.

Because inflection explanations are usually less relevant to the user then the english translation I wanted to minimize the amount of vertical space it took up, but if this is difficult to do with QT, I don't think it's worth much effort trying.

spacehamster commented 5 months ago

@BigBoyBarney I still don't understand the extent of your suggestion. Do you mean to parse the output of mecab to derive a yomichan style inflection explainations (negative < tara for 温かくなかったら, potential < negative < past for 泳げなかった) , or to show the raw mecab tags (基本連用形 < タ系条件形 for 温かくなかったら, 未然形 < タ形 for 泳げなかった) ?

just use an appropriate version of Unidic

Unidic is pretty big at ~500 MB zipped, ~1GB unzipped for the light weight version.

has none of the issues a western written deinflector would have

I'm not clear on what those issues are.

in addition to being roughly 10 billion times faster

I don't believe performance of either mecab or a deinflector is an issue. I did some very quick benchmarks and they were on the same order of performance, and the dictionary queries as a whole were dwarfed by the Qt UI code that took like ~90-99% of the runtime.

Calvin-Xu commented 5 months ago

I don't really have any preferences on this feature as I have been fine without it until now. Just some of my thoughts here:

As I mentioned a long time ago in https://github.com/ripose-jp/Memento/issues/109#issuecomment-1154586922, I don't think it is very productive to try to make MeCab's output match western style explanations (often known collectively as 日本語教育文法, but differs between instructions) as MeCab's output is aligned with the more traditional 学校文法 that's taught to native speakers. Here's a table listing some of the differences:

image

excerpt taken from 考えて、解いて、学ぶ 日本語教育の文法 (https://www.3anet.co.jp/np/books/4692/)

I really do think natural language has too many edge cases and no rule-based parser (MeCab included) has been accurate enough for me to have been able to rely on it during learning. If we really add this feature, and don't want to learn school grammar, I think a custom parser would probably be better than trying to shoehorn MeCab output.

However, to be honest I feel the most helpful thing in this day and age is probably to pass the context to some LLM server you have running and ask it to explain the usage of the word being looked up.

Calvin-Xu commented 5 months ago

For a very simple example where MeCab fails:

人に委ねるなってことですか?

MeCab with IPAdic fails to segment the text properly at "人に委ねるな", "ってことですか" and thinks there's a なる. Bad segmentation from MeCab is really the issue that affects Memento, though the addition of the search/lookup feature ameliorates that: if the word we want to look up cannot be selected, just type it with Ctrl-R

人   名詞,一般,*,*,*,*,人,ヒト,ヒト
に   助詞,格助詞,一般,*,*,*,に,ニ,ニ
委ねる 動詞,自立,*,*,一段,基本形,委ねる,ユダネル,ユダネル
なっ  動詞,自立,*,*,五段・ラ行,連用タ接続,なる,ナッ,ナッ
て   助詞,接続助詞,*,*,*,*,て,テ,テ
こと  名詞,非自立,一般,*,*,*,こと,コト,コト
です  助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
か   助詞,副助詞/並立助詞/終助詞,*,*,*,*,か,カ,カ
?   記号,一般,*,*,*,*,?,?,?
EOS

with Unidic it works and is probably the best thing for inflections, but I also understand the reasons this project doesn't want to use it.

人   ヒト  ヒト  人   名詞-普通名詞-一般      

に   ニ   ニ   に   助詞-格助詞      

委ねる ユダネル    ユダネル    委ねる 動詞-一般   下一段-ナ行  終止形-一般

な   ナ   ナ   な   助詞-終助詞      

って  ッテ  ッテ  って  助詞-副助詞      

こと  コト  コト  事   名詞-普通名詞-一般      

です  デス  デス  です  助動詞 助動詞-デス  終止形-一般

か   カ   カ   か   助詞-終助詞      

?           ?   補助記号-句点     

+           +   補助記号-一般     

EOS
BigBoyBarney commented 5 months ago

I don't think it is very productive to try to make MeCab's output match western style explanations [...] and don't want to learn school grammar

That was my initial point. Western grammar is not, and will not be applicable to Japanese to a satisfactory extent. I don't like Yomi's deinflection explanations either, so I was not suggesting implementing the exact same functionality with MeCab. This misunderstanding was possibly a mistake on my part, I should have clarified my point more.

no rule-based parser (MeCab included) has been accurate enough for me to have been able to rely on it during learning. [...] with Unidic it works

Unidic, given the appropriate dictionary (現代書き言葉, 現代話し言葉 or a version of 古文用 if you're reading classical Japanese), always works. It might not be the first entry that's applicable to the given context, but if you use a -N10 or similar flag to capture more outputs, I guarantee that the answer will be there.


I'm not clear on what those issues are.

My main gripe with it is that it's not how the language works. I would much prefer simply tabulating the MeCab output directly in a dropdown of some sort, so it can be viewed from within Memento. I'm already using it in a separate window while watching or reading, so this would streamline my workflow. However, I understand that a lot of people, allegedly the majority who use Memento, would prefer direct western grammar equivalents.

I only ask that if this ends up being merged, please keep the current MeCab parser as a togglable option as well.

Calvin-Xu commented 5 months ago

Ultimately prescriptive grammar are academic constructs and I personally do not believe it is necessary to learn either system to be able to accurately understand contemporary Japanese given one consumes enough content, which after all is the postulate of language acquisition through mass immersion.

I must say the case for 学校文法, besides most academic texts in Japanese use its terms, is that it is just indispensable for analyzing classical Japanese, whereas 教育文法 is only fitted on contemporary Japanese and does not generalize. Though I doubt that is the main use case for Memento by most users. @BigBoyBarney If you have a workflow and materials that you learn classical Japanese in Memento with ([lesson?] videos and accurate transcriptions / subtitles), please send me an email and I'd be very interested.

In that case again Memento could use better segmentation or just custom selection of text to look up, but that is a different matter.

Calvin-Xu commented 5 months ago

I only ask that if this ends up being merged, please keep the current MeCab parser as a togglable option as well.

I don't think there's a current MeCab parser option. Unless I've lost track of recent development, Memento uses MeCab to do segmentation only, to get the substring to look up when you mouse over it. I also hope this can still be an option.

This PR adds a "Deconjugator" which Memento never had before. @spacehamster actually how do you mean replacing MeCab with this? I assume you are not removing it. Are you giving progressively longer inputs to your deconjugator?

spacehamster commented 5 months ago

This PR adds a "Deconjugator" which Memento never had before. @spacehamster actually how do you mean replacing MeCab with this? I assume you are not removing it. Are you giving progressively longer inputs to your deconjugator?

The initial idea was to remove mecab completely. The code that uses mecab is currently commented out in the PR. Mecab is only used to find any possible deconjugated forms in a sentence, so if you give it 行かれる途中でしたか?, mecab will generate 行く, 行かれる途中です, 行かれる途中でる, 行かれる途中だ, 行かる which are then used to search the dictionary. It's more like partial segmentation because memento relies on the user cursor position to find the start of a word.

The deconjugator does the same thing, but generates an inflection explanations while it's at it.

Are you giving progressively longer inputs to your deconjugator?

Yes, for the sentence 行かれる途中でしたか? it will generate the queries

行かれる途中でしたかる 行かれる途中でしたる 行かれる途中でしる 行かれる途中でする 行かれる途中です 行かれる途中でる 行かれる途中る 行かれる途る 行かれるる 行かる 行く 行かれる 行る

I've been thinking about it more, and I think it may be a good idea to leave mecab in and have it be an option to choose one method or the other. Classical japanese is a bit of an niche example, but I think a more likely scenario is obscure grammar points and dialects like kansai-ben, yomichan cannot parse stuff like 戦わざるをえない or 行かへん which mecab can manage.

Calvin-Xu commented 5 months ago

@spacehamster I think this sounds good. I personally won't be needing it to save some cycles due to how intensive Qt + mpv already is. I do wonder about the performance when starting to parse from the leftmost position. Though the length of a line should be bounded my dictionary db is almost 3GB.

More than anything I think we should sit back and make sense of the terminology in this discussion. First, MeCab is a tokenizer and morphological analyzer. If I understand correctly, your "deconjugator" is also a morphological analyzer that needs the stem/root form of the word.

You still use MeCab's morphological analyzer to get the stem, it's just you want to generate the western style explanation instead of using MeCab's output. Can you also clarify if/how you are using MeCab's tokenization?

I think we should clearly define things and figure out consistent terminology to use, then decide the direction we should take.

spacehamster commented 5 months ago

@Calvin-Xu I'm primarily motivated by seeing inflection explanations implemented. I used a deconjugator approach because:

If I understand correctly, your "deconjugator" is also a morphological analyzer that needs the stem/root form of the word.

I'm interpreting stem/root form to mean the plain/dictionary form such as 食べる and not 食べ. The deconjugator doesn't take the dictionary form as input, but emits it as output, so it replaces mecab's functionality and doesn't use mecab at all.

A pseudocode simplication of how searchTerms in dictionary.cpp currently works.

function searchTerms(query, subtitle)
{
    //query is substring of subtitle, in this example
    //query=行かれる途中でしたか?
    //subtitle=王立学園に行かれる途中でしたか?
    let terms = [];
    for(let i = 1; i < query.length; i++)
    {
        //Search for exact matches
        //行, 行か, 行かれ, etc
        terms += db.search(query.substring(0, i))
    }
    let mecabQueries = generateMecab(query);
    //generateMecab produces a list of surface and deconj forms using mecab
    //only deconj is used for searching the db, surface is used for highlighting the subtitle and clozebody
    //[{ deconj: "行く", surface: "行か" },
    // { deconj: "行かれる途中です", surface:"行かれる途中でし" },
    //...snip
    // { deconj: "行かる", surface:"行かれ" }]
    for(let mecabQuery of mecabQueries)
    {
        terms += db.search(mecabQuery);
    }
    terms = terms.sorted();
    emit termsChanged(terms); //Let popup widget know there are new terms
}

and the proposed change

function searchTerms(query, subtitle)
{
    //query is substring of subtitle, in this example
    //query=行かれる途中でしたか?
    //subtitle=王立学園に行かれる途中でしたか?
    let terms = [];
    for(let i = 1; i < query.length; i++)
    {
        //Search for exact matches
        //行, 行か, 行かれ, etc
        terms += db.search(query.substring(0, i))
    }
    let deconjugatorQueries = deconjugate(query);
    //deconjugate produces a list of deconj forms, surface forms, and inflection explainations
    //without using mecab
    //only deconj is used for searching the db, surface is used for highlighting the subtitle and clozebody
    //[{ deconj: "行かれる途中でしたかる", surface: "行かれる途中でしたか", explaination: "masu stem" },
    // { deconj: "行かれる途中でしたる", surface: "行かれる途中でした", explaination: "masu stem" },
    //...snip
    // {deconj: "行く" surface: "行かれる", explaination: "passive" },
    // {deconj: "行る" surface: "行", explaination: "masu stem" },
    // {deconj: "行かる" surface: "行かれ", explaination: "imperative" } ]
    for(let deconjugatorQuery of deconjugatorQueries)
    {
        terms += db.search(deconjugatorQuery);
    }
    terms = terms.sorted();
    emit termsChanged(terms); //Let popup widget know there are new terms
}

I do wonder about the performance when starting to parse from the leftmost position.

Are you imagining the search is done progressively, showing the user results one by one as they are found? The current implementation (both mecab and deconjugator) searches the database in one go and collects all the definitions before showing them to the user.

Though the length of a line should be bounded my dictionary db is almost 3GB.

Searches from mousing over subtitles currently bounded with a max length of 37 characters (for both mecab and deconjugator). I don't believe the search widget has a bound.

I did some quick performance logging for what I believe is a typical case, which was enough to convince me performance was not a major concern as the QT UI was the main bottleneck. I have not looked at worst case pathological cases, my sqlite dictionary is 600 MB so it may get worse if it's 3GB with very large inputs.

mecab_perf_log.txt

deconjugator_perf_log.txt

I think it would be a good idea to have a global settings flag to switch between the original method and the new method, because there are cases where a deconjugator will fail like kansai-ben 行かへん, but as a bonus it would also allow you to compare performance side by side.

Calvin-Xu commented 5 months ago

Great. Let's just wait for what @ripose-jp thinks. I think this accomplishes the goal of having the feature Yomichan/Yomitan has. Given the renewed interest in UniDic, I have opened another issue on supporting it https://github.com/ripose-jp/Memento/issues/211

ripose-jp commented 5 months ago

I'm not interested in discussing the merits of 教育文法 vs 学校文法 because I am unqualified to do so. I read Tae Kim's Grammar Guide and have mostly just learned by feel since. That said, if this works well, it will replace MeCab as the default. This style of deconjugation is what most Memento users will likely want as it is more consistent with the way most people learn Japanese as a second language. I'll probably make enabling MeCab support a compile time option so I can drop the dependency on the binaries I distribute. I promise that MeCab support will remain for those that want it.

To achieve this multiple deconjugation approach, I believe it's probably for the best that I rework how searching currently works. Right now it's a bit of a rats nest that hard codes a start to finish algorithm. I think a pipeline approach would be best. It would look something like

Query Generator → Query Merge → Duplicate Filter → Database Query → Results

The query generation step would be the most relevant to this PR. I'll probably make an interface class that query generators can implement. Then I'll rework the code so existing generators are implemented in something like ExactGenerator, MeCabGenerator, and MultiGenerator. This way query generators are much more modular and don't require so much additional code to implement. This PR would be mostly contained to creating a DeconjugationGenerator save for a few potential changes to the Database Query step to fix some problems with the PR. This also opens the door to other generators such as a UniDic generator in the future. I don't expect you to implement this yourself, so I ask that you hold off sweating the details for how your code fits into Memento at the moment. I'll try to get this done this weekend.

the noun form 成り立ち is more relevant, but the verb form 成り立つ is shown first with the masu stem deconjugation

I agree. There is probably something that can be changed in the sorting algorithm to make this happen. I don't expect it to be too difficult.

If you search しよう (to do, volitional), it will attempt to deconjugate しよう to しる using the volitional rule, and then will return 汁 (juice/sap​). This would be a valid result if 汁 was a verb but it's a noun. Another example is けれど to く (imperative) and finds the noun 句. If part of speech tags aren't guaranteed, terms without okurigana can be assumed to be nouns and filtered that way.

Yomichan does a fairly poor job of finding しよう as する considering it doesn't even rank. I don't think this is such a big deal considering that しよう is fairly common word that happens to be an exception to standard rules, so most people will learn it on it's own rather than in the greater grammatical framework. Likewise, most dictionaries will have an entry for けれど, so as long as that ranks above any irrelevant results, I believe it's not a big issue. Of course, try to fix these things if you can, but don't sweat them if you can't.

Because inflection explanations are usually less relevant to the user then the english translation I wanted to minimize the amount of vertical space it took up, but if this is difficult to do with QT, I don't think it's worth much effort trying.

Your deconjugator is providing valuable information, so the vertical space is well warranted.

Overall, great PR and I have no reason to believe this won't be merged eventually.

spacehamster commented 5 months ago

I've pushed the current state that I have that has some very minor changes, i've been holding off making any big changes. I've added some more deconjugation forms. I added a quick and easy query filter because the duplicates were annoying me. I added a config flag that allows for choosing between mecab and deconjugator, I don't think this is a great option for normal users because it requires technical understanding of the difference between parsing methods to know what it does, but I found it useful to switch between the two quickly for testing.

ripose-jp commented 5 months ago

I've pushed my update that makes adding new deconjugators much easier. It does all the work of merging results and filtering duplicates later in the pipeline, so you just need to focus on outputting queries. Most of your code should be moved into a class that inherits from QueryGenerator. You may need to modify SearchQuery to include some additional metadata about your deconjugations, but I'll leave the details up to you. Also I've made compiling with MeCab an option as well, so you don't have to worry about hacking it out anymore.

spacehamster commented 5 months ago

I've updated the latest changes. I think the output is in a pretty good state now. I think there's room for refactoring to improve performance and readability, I've only been trying to ensure the output was correct. I think I've got all the common forms now.

Some notes:

ripose-jp commented 5 months ago

Looks pretty good from what I can see. Only issue I found in my brief testing was 足りとる being identified as a masu stem instead of an odd form of 足りている, but that's not a common so it's completely understandable. It even out performs Yomichan in some places. I found this one particularly impressive.

image

I did notice that spacing in some of layouts was broken though. There should be a bit of space between the buttons and some space between the deconjugation information and tags.

In terms of layouts, I did say that the deconjugaton information should be on it's own line, but now I'm having second thoughts. It seems like the best place to put it would be on the same line as the term tags, probably to the right of them. This is a FlowLayout, so it should handle long deconjugatons gracefully.

Last thing is a nitpick, but I find something about the < character aesthetically unpleasing. It seems to blend in too much with the rest of the text to a delimiter. I also don't like how obvious it is the - character doesn't align with the horizontal center of the < character. I'd recommend playing around with unicode and finding something to replace <. Here are some options to get you started: ← ‹ < ↤ ↢ ⇐ ⇦ ⟸ ⟵.

I wasn't sure how you wanted to decide between parsers when mecab compile flag is enabled, a UI controllable flag seems like it'd make the most sense?

Don't worry about this for your PR. I'm probably going to have a set of checkboxes that will allow the use to enable/disable their choice of query generators.

spacehamster commented 5 months ago

I'd not seen the -とる contraction before, I'll add it. There were a few forms I found that yomichan didn't have, like 食べん and 食べなきゃ. Yomichan conjugates 食べてった as progressive or perfect which seems to be a bug. I'm not a fan of how 食べていくis deconjugated as "-te < progressive or perfect < masu stem", it'd make more sense to just do "-te", but I think that'd require hardcoding an exception for that particular form.

The spacing and the '<' looked different on my machine.

image image

I think I was trying to get rid of the padding within the labels, as the spacing between the headword and the deconjugation seemed excessive. I'll change the spacing back and see if I can fix label padding with the custom style sheets or if the flowlayout helps or something. Layout spacing was the wrong place to change anyway.

How is " « " for the delimiter?

image

BigBoyBarney commented 5 months ago

I know I was quite opposed to this implementation in the other related PR, but this seems to work a lot better than I had initially thought. My apologies.

-とる is a contraction of -ておる which is 関西弁 for -ている, and is not standard Japanese. How do you intend to handle such cases, as a complete implementation of all 関西弁 variants would be a herculean task. Could the deinflector possibly label such cases as "unknown"?

spacehamster commented 5 months ago

@BigBoyBarney I thought -ておる was a humble equivalent of -ている in standard Japanese, and is the standard form of -ている in 関西弁? I did find some stackexchange posts mentioning that it would be weird to hear -とる in modern standard Japanese, but either way, just quickly searching my own media, it seems to pop up a fair bit, particularly in historical fantasy from older characters.

image

I think in general, including dialectical forms is out of scope, my knowledge of dialectical Japanese is essentially zero and I wouldn't be likely to use it. I think would make sense to include exceptions for forms you're likely to see in media targeting standard Japanese viewers.

A couple strategies I can think of for tackling non-standard Japanese:

BigBoyBarney commented 5 months ago

I thought -ておる was a humble equivalent of -ている

You are correct. おる is 謙譲語 for いる in standard dialect, and is just a normal いる in a lot of other dialects

[…] it seems to pop up a fair bit […] I think would make sense to include exceptions for forms you're likely to see in media targeting standard Japanese viewers.

I agree, maybe hardcoding some of the more common forms could provide a reasonable coverage? How does Yomitan handle this?

関西弁 is quite common, and there are numerous shows where certain characters speak exclusively that, however I think it would be important and nice to have some kind of an indicator that the given form is not necessarily standard Japanese.

Rely on mecab parsing. I haven't tried mecab with non-standard Japanese, so I'm not sure how effective this is.

Assuming UniDic, Mecab parses 関西弁 correctly but gives no explanation as to whether it's dialectal or not. For 足りとる it would simply tell you that it's a combination of 足りる and the auxiliary verb とる, which the user's dictionary (for example JMdict) should have an entry for, which it generally does, but it's not a straightforward process. So I'm afraid mecab does not really help in this case.


However, given the choice between (potentially) incorrect grammar information (as it might be the case with Yomi and dialectal Japanese) and none, I personally would prefer the latter. Maybe an indicator for a "guessed" grammar pattern that does not fit the existing inflections could be an option?

spacehamster commented 5 months ago

However, given the choice between (potentially) incorrect grammar information (as it might be the case with Yomi and dialectal Japanese) and none, I personally would prefer the latter. Maybe an indicator for a "guessed" grammar pattern that does not fit the existing inflections could be an option?

I'm not sure I understand what you mean. If you mean distinguishing between 関西弁, 役割語 and 取る usages of とる, it's not really feasible. Natural language is so context sensitive all results shown are guesses. It's not just deconjugation, the only way to know if 方 is かた or ほう is to understand the whole sentence and determine which makes sense given the context.

If you mean marking potentially ungrammatical deconjugations like using godan conjugations to find ichidan verbs, eg 食べった matching 食べる, it does currently do that and I consider it a bug so i'll try to fix it so it doesn't do it in the future (it merges the ichidan masu form query and the godan past form query into one query and this can be fixed by separating them out into two queries at the risk of sometimes getting duplicate search results).

If you mean something like showing that there are multiple possible deconjugations, like how 聞け could be either imperative or potential < masu stem, that's possible and yomitan already does that but I personally feel it's a waste of space because cases where it's helpful are rare and as I understand it yomitan added the feature mainly to support non-japanese languages. image.

If you mean parsing something that is unmistakably dialectical speech, such as kansai-ben 食べへんかった or kagoshima-ben 稼がん/稼だ, I think it makes sense to include common forms like kansai-ben, yomitan already does this image

and for kagoshima-ben you could try searching for the 未然形 form 稼が or the 音便形 form 稼 and using those to do the dictionary lookup, but you wouldn't get any information about what form they really are and I'm skeptical if it's a good idea. Yomitan cannot parse 稼がん/稼だ.

ripose-jp commented 5 months ago

I think I was trying to get rid of the padding within the labels, as the spacing between the headword and the deconjugation seemed excessive. I'll change the spacing back and see if I can fix label padding with the custom style sheets or if the flowlayout helps or something. Layout spacing was the wrong place to change anyway.

I'd just try adding the deconjugation label as the last element of m_layoutTermTags. I'd be surprised if you had to do much more than that to make it look good.

How is " « " for the delimiter?

Perfect.


Regarding the whole debate about whether or not to deconjugate とる and other "non-standard" ways of speaking, I think the deconjugator should support as many forms as possible. What's important is that the user understands what's being said, not necessarily the right context in which to use it themselves. There are other resources better suited to that than Memento. For strange ways of speaking, it should be pretty obvious to listeners that the character throwing around わし, とる, じゃ, and the like isn't speaking like a normal person. The likelihood of a JSL learning the language like an ESL only reading Shakespeare is very slim. Frankly if they do something like that, they get what they deserve.

You have my complete support adding any regional forms you come across in media, so long as you deconjugate them correctly.

spacehamster commented 5 months ago

@ripose-jp I was playing around with different layouts for the explanation label. It doesn't seem like FlowLayouts supports text wrapping around objects or splitting across lines so it doesn't seem like a good fit.

I thought of three ways of doing it:

A) Having a single label for kanji + kana + explanation and styling it with rich text B) Determining how much screen space the explanation takes up and positioning it based on the explanation pixel width. Can either guess based off character count or get an exact result with QFontMetrics. C) Use a grid layout.

I don't like option A btw.

image

ripose-jp commented 5 months ago

A) Having a single label for kanji + kana + explanation and styling it with rich text

Doesn't work because Qt doesn't support <ruby> tags.

B) Determining how much screen space the explanation takes up and positioning it based on the explanation pixel width. Can either guess based off character count or get an exact result with QFontMetrics.

This would be cool if you could pull it off.

C) Use a grid layout.

I'm okay with the third, but I think the way you've laid it out could be better. Instead of using a grid layout, I think you should stack layouts since it will keep the kana and kanji labels closer together if they are a single element in their own vertical layout. Here's a screenshot of how I might do it.

image

For terms with kana and kanji, I think this is a great use of space.

image

For terms that are just kana though, I think it would be better if the deconjugation was on the line beneath since the kana will line up with the buttons better and there's less empty space on the top and bottom.

image

BigBoyBarney commented 5 months ago

Doesn't work because Qt doesn't support tags.

Could this potentially be solved with QtWebKit?

ripose-jp commented 5 months ago

I explicitly ban QtWebEngine as a dependency in CONTRIBUTING.md. Pulling in the entirety of Chromium just for <ruby> tags is crazy.

spacehamster commented 5 months ago

To be clear, I wasn't talking about ruby, but something like

qlabel.setText("<span style=\"font-size:12px\">かな</span><br><span style=\"font-size:18px\">漢字</span><span style=\"font-size:12px\">deconjugation « explaination</span>")

because it lets deconjugation explanations sit on the same line as the kanji and wrap directly below it. I don't think it's a good fit here.

Webkit ui layouts would be nice but at that point you may as well use electron and write the whole thing in javascript.

I've changed the way queries were merged to prevent incorrect deconjugation explainations from being shown, this means it will generate multiple queries with the same deconj string.

I've changed the location of the deconj string,

image

image

There seems to be a bug in how QT handles word wrap and layouts which causes extra space between layouts when the window gets too small. This only seems to show up in search tool because it has a smaller width, and setting the deconj inline minimum width to 180px seems to solve it for most reasonable cases, though it'll still popup in extreme cases.

image

An alternative might be putting the main tags on the top row and the deconjugation string under.

image

I briefly tried to revert the def.rules back to a taglist, but that actually broke things because m_tagCache returns an empty tag so def.rules don't get populated.

ripose-jp commented 5 months ago

If you could get your label HTML to work, it wouldn't be a terrible idea. I don't think it will work though since you have to somehow center the kana over the kanji and I don't see a way to do that using the HTML subset supported by Qt.

The two pictures you showed look excellent. I'd be fine merging those as is.

There seems to be a bug in how QT handles word wrap and layouts which causes extra space between layouts when the window gets too small. This only seems to show up in search tool because it has a smaller width, and setting the deconj inline minimum width to 180px seems to solve it for most reasonable cases, though it'll still popup in extreme cases.

I don't find this concerning since 180px width is an extreme case and I don't see the need to make it a great user experience.

Just to note, it might be worth rebasing on master since there was a bug in the duplicate filtering that was fixed by 923b7d248627fa8cdc47a964d8f59fb155f39a46. I don't want you to think that's your code's fault and waste time on it.

ripose-jp commented 4 months ago

I wanted to check in to see if you're ready to unmark this as a draft. I've used it for the past few weeks and functionally it has worked perfectly.

spacehamster commented 4 months ago

Yes. I also haven't noticed any issues so I think it is good enough now.

ripose-jp commented 4 months ago

I've gone through the code and made a few changes related to optimization, style, trailing whitespace, etc. I've included the diff below. It looks like it functions the same as the current code, but I recommend you test it. I agree with you about rules now, so no need to make any changes related to that. Just update the commit with the diff and make sure the message complies with the guidelines in CONTRIBUTING.md. I'll merge it once that's done.

Thanks for all the hard work.

Changes Diff ```diff diff --git a/src/dict/databasemanager.cpp b/src/dict/databasemanager.cpp index 37e35cb..c2808a4 100644 --- a/src/dict/databasemanager.cpp +++ b/src/dict/databasemanager.cpp @@ -618,7 +618,13 @@ int DatabaseManager::populateTerms(const QList &terms) const (const char *)sqlite3_column_text(stmt, COLUMN_DEF_TAGS), def.tags ); - def.rules = (const char *)sqlite3_column_text(stmt, COLUMN_RULES); + QStringList rules = QString( + (const char *)sqlite3_column_text(stmt, COLUMN_RULES) + ).split(' '); + def.rules = { + std::move_iterator{std::begin(rules)}, + std::move_iterator{std::end(rules)} + }; term->definitions.append(def); } if (isStepError(step)) diff --git a/src/dict/deconjugationquerygenerator.cpp b/src/dict/deconjugationquerygenerator.cpp index af3e2a8..fa90d49 100644 --- a/src/dict/deconjugationquerygenerator.cpp +++ b/src/dict/deconjugationquerygenerator.cpp @@ -21,7 +21,6 @@ #include "deconjugationquerygenerator.h" #include "deconjugator.h" -#include #include #include "util/utils.h" @@ -54,34 +53,37 @@ std::vector DeconjugationQueryGenerator::generateQueries( { return {}; } + QList deconjQueries = deconjugate(text); std::vector result; for (ConjugationInfo &info : deconjQueries) { QString rule = convertWordformToRule(info.derivations[0]); auto duplicateIt = std::find_if( - result.begin(), + result.begin(), result.end(), - [=](SearchQuery o) { - return o.deconj == info.base && - (o.ruleFilter.contains(rule) || - o.conjugationExplanation == info.derivationDisplay); + [&] (const SearchQuery &o) + { + if (o.deconj == info.base) + { + return o.ruleFilter.contains(rule) || + o.conjugationExplanation == info.derivationDisplay; + } + return false; } ); if (duplicateIt != result.end()) { - if(!duplicateIt->ruleFilter.contains(rule)) - { - duplicateIt->ruleFilter.push_back(rule); - } + duplicateIt->ruleFilter.insert(std::move(rule)); } else { - result.push_back({ - info.base, - info.conjugated, + result.emplace_back(SearchQuery{ + info.base, + info.conjugated, { rule }, - info.derivationDisplay }); + info.derivationDisplay + }); } } diff --git a/src/dict/deconjugator.cpp b/src/dict/deconjugator.cpp index adcc8e5..6b7b213 100644 --- a/src/dict/deconjugator.cpp +++ b/src/dict/deconjugator.cpp @@ -33,7 +33,7 @@ struct Rule WordForm conjugatedType; }; -const QList silentRules = +static const QList silentRules = { { u8"ない", u8"ない", WordForm::negative, WordForm::adjective }, { u8"たい", u8"たい", WordForm::tai, WordForm::adjective }, @@ -50,7 +50,7 @@ const QList silentRules = { u8"とく", u8"とく", WordForm::toku, WordForm::godanVerb }, }; -const QList rules = { +static const QList rules = { //Negative { u8"る", u8"らない", WordForm::godanVerb, WordForm::negative }, { u8"う", u8"わない", WordForm::godanVerb, WordForm::negative }, @@ -556,16 +556,16 @@ static bool isTerminalForm(WordForm wordForm) } static ConjugationInfo createDerivation( - const ConjugationInfo &parent, + const ConjugationInfo &parent, const Rule &rule) { QList childDerivations(parent.derivations); if (childDerivations.isEmpty()) { - childDerivations.insert(childDerivations.begin(), rule.conjugatedType); + childDerivations.prepend(rule.conjugatedType); } - childDerivations.insert(childDerivations.begin(), rule.baseType); - int replacementStart = parent.base.size() - rule.conjugated.size(); + childDerivations.prepend(rule.baseType); + qsizetype replacementStart = parent.base.size() - rule.conjugated.size(); QString childWord = QString(parent.base) .replace(replacementStart, rule.conjugated.size(), rule.base); ConjugationInfo childDetails = { @@ -584,7 +584,7 @@ static void deconjugateRecursive( const QString word = info.base; for (const Rule &rule : rules) { - WordForm currentWordForm = info.derivations.size() == 0 ? + WordForm currentWordForm = info.derivations.empty() ? WordForm::any : info.derivations[0]; if (rule.conjugatedType != currentWordForm && @@ -599,19 +599,27 @@ static void deconjugateRecursive( ConjugationInfo childDetails = createDerivation(info, rule); if (isTerminalForm(rule.baseType)) { - results.push_back(childDetails); + results.emplace_back(childDetails); for (const Rule &silentRule : silentRules) { - if (silentRule.conjugatedType != rule.baseType) continue; - if (!childDetails.base.endsWith(silentRule.base)) continue; - Rule derivedRule = { - rule.base, - rule.conjugated, - silentRule.baseType, - rule.conjugatedType }; + if (silentRule.conjugatedType != rule.baseType) + { + continue; + } + if (!childDetails.base.endsWith(silentRule.base)) + { + continue; + } + Rule derivedRule = { + rule.base, + rule.conjugated, + silentRule.baseType, + rule.conjugatedType + }; ConjugationInfo derivedDetails = createDerivation( - info, - derivedRule); + info, + derivedRule + ); deconjugateRecursive(derivedDetails, results); } } @@ -627,19 +635,25 @@ static QString formatDerivation(QList derivations) QString result; QList displayRules; std::copy_if( - derivations.begin(), - derivations.end(), - std::back_inserter(displayRules), - [] (WordForm ruleType) { - if (ruleType == WordForm::conjunctive) return false; - if (isTerminalForm(ruleType)) return false; + derivations.begin(), + derivations.end(), + std::back_inserter(displayRules), + [] (WordForm ruleType) + { + if (ruleType == WordForm::conjunctive) + { + return false; + } + else if (isTerminalForm(ruleType)) + { + return false; + } return true; } ); - if (derivations.size() > 0 && - derivations[derivations.size() - 1] == WordForm::conjunctive) + if (derivations.size() > 0 && derivations.back() == WordForm::conjunctive) { - displayRules.push_back(WordForm::conjunctive); + displayRules.emplace_back(WordForm::conjunctive); } for (int i = 0; i < displayRules.size(); i++) @@ -659,14 +673,14 @@ QList deconjugate(const QString query, bool sentenceMode) if (sentenceMode) { QString word = query; - while(!word.isEmpty()) + while (!word.isEmpty()) { ConjugationInfo detail = { word, word, QList(), "" }; deconjugateRecursive(detail, results); word.chop(1); - } + } } - else + else { ConjugationInfo detail = { query, query, QList(), ""}; deconjugateRecursive(detail, results); @@ -678,4 +692,4 @@ QList deconjugate(const QString query, bool sentenceMode) } return results; -} \ No newline at end of file +} diff --git a/src/dict/deconjugator.h b/src/dict/deconjugator.h index 335310d..4d3f4bd 100644 --- a/src/dict/deconjugator.h +++ b/src/dict/deconjugator.h @@ -24,7 +24,8 @@ #include #include -enum class WordForm { +enum class WordForm +{ godanVerb, ichidanVerb, suruVerb, @@ -73,15 +74,18 @@ enum class WordForm { /** * A struct that contains the results of a deconjugation */ -struct ConjugationInfo { +struct ConjugationInfo +{ /* Plain form of a word. */ QString base; + /* The original conjugated form. */ QString conjugated; + /* A list of conjugations that describe the relationship - * between base and conjugated. - */ + * between base and conjugated. */ QList derivations; + /* A human readable format of the derivations. */ QString derivationDisplay; }; @@ -93,6 +97,7 @@ struct ConjugationInfo { * find potential words by trimming the query * @return A list of all the potential deconjugations found */ -QList deconjugate(const QString query, bool sentenceMode=true); +QList deconjugate( + const QString query, bool sentenceMode = true); -#endif // DECONJUGATOR_H \ No newline at end of file +#endif // DECONJUGATOR_H diff --git a/src/dict/dictionary.cpp b/src/dict/dictionary.cpp index 4f02d8c..07afb15 100644 --- a/src/dict/dictionary.cpp +++ b/src/dict/dictionary.cpp @@ -131,7 +131,7 @@ SharedTermList Dictionary::searchTerms( } sortQueries(queries); - //filterDuplicates(queries); + filterDuplicates(queries); if (index != *currentIndex) { return nullptr; @@ -155,26 +155,22 @@ SharedTermList Dictionary::searchTerms( } if (query.ruleFilter.size() > 0) { - results.erase(std::remove_if( - results.begin(), - results.end(), - [=](SharedTerm val) { - for (const TermDefinition &def : val->definitions) + results.erase( + std::remove_if( + results.begin(), + results.end(), + [&] (const SharedTerm &val) { - for (const QString &rule : query.ruleFilter) + for (const TermDefinition &def : val->definitions) { - //Should technically call - //def.rules.split(" ").contains(rule) - //but it's not required because no rules are - //substrings of other rules - if(def.rules.contains(rule)) + if (def.rules.intersects(query.ruleFilter)) { return false; } } + return true; } - return true; - }), + ), results.end() ); } diff --git a/src/dict/expression.h b/src/dict/expression.h index a0396ac..04e03f4 100644 --- a/src/dict/expression.h +++ b/src/dict/expression.h @@ -24,6 +24,7 @@ #include #include #include +#include #include #include #include @@ -101,7 +102,7 @@ struct TermDefinition QList tags; /* A list of the rules associated with this entry. */ - QString rules; + QSet rules; /* A list of glossary entries for this definition. */ QJsonArray glossary; diff --git a/src/dict/searchquery.h b/src/dict/searchquery.h index 5fe4f7d..32cd3bf 100644 --- a/src/dict/searchquery.h +++ b/src/dict/searchquery.h @@ -21,8 +21,8 @@ #ifndef SEARCHQUERY_H #define SEARCHQUERY_H +#include #include -#include /** * A pair to search for. The deconjugated string is used for querying the @@ -36,11 +36,11 @@ struct SearchQuery /* The raw conjugated string */ QString surface; - /* Filter results based on part of speach */ - QStringList ruleFilter; + /* Filter results based on part of speech */ + QSet ruleFilter; - /* The conjugation explaination of a term. - Usually empty if the term was not conjugated. */ + /* The conjugation explanation of a term. Usually empty if the term was not + * conjugated. */ QString conjugationExplanation; }; ```
ripose-jp commented 4 months ago

I wanted to check in to see if you missed my last message. I've been using my patch and it works perfectly fine. I don't mind modifying the commits myself if you're busy. I want to get this merged within the next couple days.

spacehamster commented 4 months ago

Sorry for taking so long, I did see the previous message, I just had trouble making time to test the patch. I didn't find any issues when testing and I've pushed the change. I likely won't be able to make further changes in a timely manner for a while so if there are any issues feel free to make changes yourself.

ripose-jp commented 4 months ago

No worries. Thanks for everything you've done so far. This will help a lot of people.