javdejong / nhk-pronunciation

Anki2 Add-On to look-up the pronunciation of Japanese expressions.
70 stars 21 forks source link

Fixed issues w/working on multiple words #12

Closed SpongebobSquamirez closed 5 years ago

SpongebobSquamirez commented 6 years ago

There are 2 main issues I've found with the pronunciation add-on, but I've found a way to fix them with some tweaks.

1) The card type name itself needs to have the word "japanese" somewhere in it. 2) The expression has to be one word, which often means it's without kana (especially without grammatical stuff attached like particles), so things like this won't work (commas separate examples): 後で, 川上・川下 or other fields with punctuation, whole sentences, things with readings attached (because of the weird characters), e.g.: 後れた【おくれた】.

I've fixed the requirement of a japanese card type (just commented that stuff out), the unnecessary kana problem (for most use-cases), the punctuation/separators problem, and the problem of multiple words strung together. This fix obviously focuses on individual words rather than prosody (which you should use OJAD/suzuki-kun for). I made it so that it will try an English/Japanese punctuation split first. If this fails it will use the Japanese Support plugin's Mecab-type sentence-splitting (which will correctly look up words even if they are separated by kana).

Sentence-type splitting occasionally splits words too much, like 歩行者 becoming 歩行 and 者, so direct lookups with simple punctuation splits are preferable when possible.

I may add some more optimizations later to try to recombine overly split-level words, or implement some stuff from the issues section, but anyway now it's much more usable than before, and you should no longer have lookup problems as long as your words are in the NHK dictionary used here. Making these fixes took this add-on from being somewhat helpful if I got lucky to working fabulously for me.

javdejong commented 6 years ago

I think the "noteTypes" stuff has been addressed in the conversion to 2.1, where users can now configure the note type in the config.json (from the Anki UI).

I agree that better pronunciation for sentences or verbs would be very nice. I currently use Anki mainly to cram nouns, and just type the noun first in the "expression" field. Then it generates example sentences and the pronunciation. Then I change the expression to turn it into a short sentence (otherwise it becomes hard to remember in reverse).

You mentioned in one of your commits about having a separate "expression" and "sentence" field, and that it might be useful to generate pronunciations for both. That is definitely something we could/should do, especially if the sentence can be split with mecab (or we use this OJAD tool, if that's even allowed).

SpongebobSquamirez commented 6 years ago

Hey, thanks for making the updates to the 2.1 version. It would definitely be great if we could merge this 2.0 version with the new 2.1 version. People I've been talking to in anki forums and IRL value the multi-word/sentence splitting more than anything but there are also quite a few people starting anki who are using 2.1 and want this. I have to re-read my commit, but I just want to clarify that this version already does sentence/delimiter/Japanese vs non-Japanese splitting with mecab, regex, etc. Regarding OJAD, I may have said some conflicting things in earlier commits (because it was a WIP at the time), but basically I think the sentence pronunciation/OJAD comment I made was about prosody and OJAD/conjugation rather than the dictionary form accents that NHK gives, since it would be good to keep those fields separate (since prosody can be derived by learners from dictionary form accent if they know all the rules, but not vice versa).

I'm pretty busy during the week, and I'm glad you're going through these issues quickly, so I'll try to follow up better this weekend. It's just that I can't give any bandwidth to this during the weekdays. If I remember correctly I fixed all the known bugs and documented everything in the latest commit, but if you look at the code there are a couple old comments suggesting there are bugs (although I think those bugs are gone), an unused function, some commented out stuff that I just haven't properly deleted, etc. If we could double check that it's cleaned up first I think we could go from there to tweak the implementation, add features, and bring this to 2.1. I've been thinking about doing that myself just following anki's guidelines, but since I don't use 2.1 I think I'd need to install 2.1 on a VM to test properly, ya? And I just haven't felt like doing that yet. ^_^;

Using OJAD for studying purposes is allowed, and scraping isn't against their robots.txt, but because of the way it works, we would need to use selenium to scrape it, and we should throttle the requests; but both of those things will mean that generation is relatively slow. The "Awesome text to speech" addon (https://ankiweb.net/shared/info/301952613) already apparently has good prosody, but since it doesn't display it, it's not as educational as OJAD (which can generate audio as well anyway, making that other addon obsolete). By the way, in case you're not sure what I'm talking about, it's this website: http://www.gavo.t.u-tokyo.ac.jp/ojad/eng/phrasing/index (for prosody/sentence-level accent generation) http://www.gavo.t.u-tokyo.ac.jp/ojad/search (for conjugation accent search)

javdejong commented 6 years ago

I definitely don't want to put any pressure on you. I myself only recently got more time (and motivation) to work on this again ;) It'd be great if you could clean up the commits a little, and remove obsolete comments/etc. That'd make it easier for me to review/comment on.

I was looking into allowing both of us to push new revisions of this add-on to AnkiWeb, but I'm not sure that is possible. I can definitely give you commit rights to master though, if you feel comfortable with that. For small fixups/changes you can then commit to it at your discretion, although larger changes (like this MR) would definitely benefit from an additional set of eyes checking it.

SpongebobSquamirez commented 6 years ago

Yeah, I can clean up the files ASAP, likely sometime this weekend. I can't promise more than that in a timely manner, though, sorry to say. When I do find time to work on things it's usually patching things that are urgent, like tweaking other Anki plugins to get them to function properly for my decks, or automating other data collection for card creation; so it tends to be whatever bugs me/distracts me at the right moment. Not that I like to abandon my work. It's just that studying Japanese is my priority on the weekends. Anyway, thanks for understanding.

I actually got asked today by someone if I was going to make a new AnkiWeb page or update the old one, so he could download the updated version easily, and I didn't have a good answer for him, instead just redirecting him to this branch with directions about how to manually update. It'd be helpful though to at least have commit rights, to be safe, in case one of us falls off the face of the Earth. Also, I haven't looked at AnkiWeb's hosting, but I guess if the 2.1 version can be maintained to be backwards compatible then maybe we only need one AnkiWeb page, but if not then having a 2.1 and a 2.0 one might be good - but either way, if it's not easy to share access, I'll have to leave that in your hands.

javdejong commented 6 years ago

When uploading add-ons you have two fields; one for the 2.0 version and one for the 2.1 version. I only updated a new version for 2.1; the 2.0 version is stuck at the old version for now.

I'm not sure it's worth it to keep the add-on backwards compatible with 2.0 due to changes in Python and Qt. Might be more trouble than it's worth, but if there are enough people out there that cannot upgrade for some reason (?) we could continue to support 2.0.

EDIT: Asked a question about shared development here: https://anki.tenderapp.com/discussions/ankiweb/3357-share-development-on-add-on If it is not possible, we can think about supporting some update functionality like

import urllib.request
import zipfile

url = "https://github.com/javdejong/nhk-pronunciation/zipball/master"
urllib.request.urlretrieve(url, "master.zip")

with zipfile.ZipFile("master.zip", 'r') as z:
    z.extractall("new_version")

EDIT2: Well, it's not possible to give you rights to the AnkiWeb add-on, so we'll have to built our own update system. I was looking around a bit, and it seems the AwesomeTTS add-on has fancy updating capabilities, so we might be able to use that as a reference (if not almost entirely copy/paste). There might be others out there with similar capabilities as well which may be more suited for us to use.

SpongebobSquamirez commented 6 years ago

I cleaned up the code just now. You'll notice that the japanese note things is still commented out, but that's just because a lot of users didn't read the fine print and change their note types to work with that function.

About compatibility, I don't know if in the long term (or even right now) if it's worth having a backwards compatible version, but at the moment there are technically 2 functioning versions, so I think at least not breaking the 2.0 version would be good if handling the Qt code is too much of a pain to use a single program for both versions. Based on this thread (and I'm sure there are other, better indicators), plenty of people still use 2.0: https://www.reddit.com/r/Anki/comments/9nh13z/do_you_guys_use_anki_21_or_20_why/

Also, I use 2.0 lol. I think most people either care about 1) their plugins not having 2.1 versions (perhaps because the original creators are no longer maintaining them) or 2) things potentially breaking, especially because this is a "beta." I'm in the first camp, myself.

I spent a couple minutes looking at AwesomeTTS's update capabilities just now (sorry, I really only spent a couple minutes on it). Are you talking about having an updating system that pulls from somewhere other than Ankiweb? Also, I noticed they use Qt even in the 2.0 version.

Oh, one other thing. You mentioned splitting sentences with mecab, which is a fallback I added to improve splitting of the expression (though it imports mecab from the Japanese support addon, which I noticed you removed in the 2.1 update). Whether we want to include mecab natively with this addon would just be a question of removing dependencies (by importing directly rather than from the Japanese support directory) versus avoiding redundancy (by relying on the file in the Japanese support addon). It's just a guess, but I do think that pretty much every Japanese learner will have the Japanese support addon already.

Let me know what you think of the changes I made, and of adding the entire bs4 folder for just one function haha.

javdejong commented 6 years ago

I'll see if I can review somewhere in the coming week.

  1. We can change the default of note type check if you want. I still want the option of people filtering on it though. I would say that an empty list means "all note types", and would then be the default.
  2. OK. I will have to check how other add-ons support both versions though.
  3. Update system: correct; we could make one that just pulls the latest tag (or just master) from GitHub.
  4. I removed the Japanese import because I couldn't get the import to work anymore. I'm not sure if add-ons can import other add-ons anymore.

As far as removing html markup, something using the stdlib would have my preference. Although it may not be as good as BeautifulSoup, it might be good enough:

import sys

if sys.version_info.major == 2:
    from HTMLParser import HTMLParser
else:
    from html.parser import HTMLParser

class HTMLTextExtractor(HTMLParser):
    def __init__(self):
        if issubclass(self.__class__, object):
            super(HTMLTextExtractor, self).__init__()
        else:
            HTMLParser.__init__(self)
        self.result = []

    def handle_data(self, d):
        self.result.append(d)

    def get_text(self):
        return ''.join(self.result)

def html_to_text(html):
    s = HTMLTextExtractor()
    s.feed(html)
    return s.get_text()
SpongebobSquamirez commented 6 years ago

Okay, sounds good. I'll try to test that HTML markup removal code this weekend. I'm currently using the same HTML parser with bs4, so it might work...というより、いけると良いですね。Looking into things a little more, the best parsers might be html5lib, lxml, or even html2text (though I'm not sure about that last one). https://www.reddit.com/r/Python/comments/1x8oar/a_quick_note_on_my_pleasant_experience_with/

I've had good experiences with lxml, though. But html5lib is pure python. Your solution is short and sweet, though, in terms of file size. Let me know if you have any thoughts on this. I don't plan on doing very extensive testing, tbh.

I'm looking into maybe improving the sentence parser. If we're going to move away from the Japanese support plugin, it would be a good opportunity to see if it can be improved on if we're going to include it natively. According to my coworkers (I work in Japan), Mecab is the best, or at least the most-well known, Japanese sentence tokenizer; but I need to look into the Japanese support addon's implementation settings. It's possible that they aren't optimal, and while I'm at it I figure I'll do some tests with the other well-known tokenizers to see if the parsing can't be improved upon. After all, our goal is to split it up so it can be looked up in a dictionary. Mecab (and maybe other splitters) apparently can also give the citation form of words. There might also be splitters/settings to split compound words up in a way that the NHK dictionary will be able to handle, even if some other dictionaries list definitions for the larger compound separately, or a way to try both the compound and the component words, etc.

By the way, how did you get the CSV file for the dictionary? Did you parse the actual EPWING file, or did you run across the CSV online somewhere? It's rare, but occasionally on lookups I'll run across a word that's not in NHK, but is in 三省堂 / 新明解国語辞典, which also lists accent (though it uses the number system). I could try using the zero-EPWING parser to make another CSV, or a master CSV, if you think it'd be worthwhile to combine the two during lookups (e.g. Sanseido could be used as a backup). I'm planning on getting some familiarity with either zero-EPWING or goldendict's EPWING parsing anyway for my own plugin (I want to try to extract the audio from my NHK accent dictionary), so I might have time to look into that. Still, it might only yield a 10% increase in the number of covered words, or something...

javdejong commented 5 years ago

The CSV file is just an export of the Access database used by some old NHK Accent Dictionary software. I don't think I ever had to deal with EPWING. If there are words missing (or other pronunciations possible), I'm very much in favor of making some master CSV. But you're probably right that the effort of merging/adding pronunciations is not worth the effort.. especially for more obscure words, I doubt the Japanese themselves even care/know (not to mention that accent varies from region to region anyway).

EDIT: I'll spend tomorrow trying to get Anki 2.0 support back in master, and then reviewing your stuff (probably also by committing things on top for you to check in turn).

SpongebobSquamirez commented 5 years ago

Sounds good.

By the way, I found a missing feature/a parsing failure (though the original version couldn't handle this either, so it shouldn't slow down rolling things out in this state). A lot of these bugs arise in situations when the expression isn't in a clean format (e.g. is a fragment; and while I'd like to blame the deck creators for that, I also have to admit that sentence parsing is highly desirable, and fragments are parsed the same way). This bug is that words that end with kana (e.g. 明らか) will be reduced to their kanji components if they are not written exactly (provided their kanji component has an entry - otherwise, I think it will probably work fine since it will then try Mecab). For example, 明らか parses properly but 明らかな parses as 明.

The no_kana function is what causes this parsing bug (compounded with the fact that Mecab is not used unless lookup doesn't return something for every expression). We could try to make an inefficient workaround for this, but I think updating the Mecab parser settings to return citation forms is probably the best and the overall easiest fix. After that, we could just use what the tokenizer gives back. I'm a bit concerned about doing that since Mecab at least has the tendency to split too much that I mentioned before, which is why right now Mecab is just a backup, but the benefits of making it the default parser might outweigh the costs かもしれません。

SpongebobSquamirez commented 5 years ago

@javdejong Sorry for the radio silence. I haven't tested the recent pull/merge you made, but I think it looks good. I've almost finished adding a few new functionalities:

I'll push when I've finished testing all this. Hopefully sometime in the next few hours if I get around to it. Tbh for testing I'm just using an Anki deck with a bunch of test cases because I haven't bothered to look into Anki plugin development enough to figure out how best to test anki addons from the command line.

It might be good to split these shared functionalities (essentially all related to word lookups) off into their own file. Unfortunately I'm not sure if it makes much sense to release this as an entirely separate "Japanese dictionary lookup patch addon," though if that was done it could maybe be something that rewrites people's existing addon files to be compatible/call the right functions lol, so it updates multiple plugins with a single install...but idk...

SpongebobSquamirez commented 5 years ago

Finished. Details in the commit. Let me know what you think, or feel free to open pull requests.

I didn't get around to testing how it performs with the other addons yet.

Unfortunately, I actually increased the Japanese support addon dependency. We'd have to take about half the code from their reading.py file, in addition to the mecab executables, if we want to eliminate the dependency, sooo yeah.

Also modified the config file (though I haven't tested that).

SpongebobSquamirez commented 5 years ago

Okay, so there wasn't a good way to use mecab with different arguments from the Japanese Support addon without recreating it for every lookup. Doing that gave me system resource problems, so I just integrated the code directly into this plugin. I removed the "dependency" from the code, but it still needs to access the files inside the Support addon's folder directly. But I feel like that ought to work even with 2.1.

This commit should be the last one for this update. I don't really have any other ideas in mind after this, so I think all that's left is to bring these changes to 2.1/merge them with master after review.

javdejong commented 5 years ago

OK. Sounds good. I will definitely have to take a look at how one might go about importing the Japanese add-on then (or any other add-on for that matter). As far as your PR is concerned, I will probably be splitting/squashing/merging it in separate commits per feature while testing said feature. Thanks for the hard work so far!

SpongebobSquamirez commented 5 years ago

Do you want me to check each of the commits to the new branches you're opening?

javdejong commented 5 years ago

I will assign you a PR for the html cleanup and delimiter splitting now, and one for mecab in a few days and the expression splitting on separators and with mecab. If it's not too much to ask, I would like to ask you to rebase/rework your color changes on top of that last on (in one proper/neat commit). There's a whole bunch of "tryout" commits in this PR, which imo should not end up in master.

EDIT: I think I couldn't quite get the colorization to work on my end, which is why I focused on just getting the splitting in. I was thinking it might make more sense to colorize the "Reading" field and not (just) the Expression. I'm also not sure what we should with coloring when a word has more than one pronunciation.

SpongebobSquamirez commented 5 years ago

@javdejong When you say colorize the reading field, you mean the field used by the Japanese support addon?

I would also like to be able to do that, but it's harder to do than for a plain sentence, because of all the braces that are inserted in the middle of words. Those would have to be ignored when color is added. By the way, in my implementation, I only scan sentences for exact dictionary matches.

Can you show me a picture of what you get when you try your implementation? Maybe we should first agree on what proper output should look like. Are you comparing your ouput to my test deck?

javdejong commented 5 years ago

Closing, as most of this has been resolved by merging #18 .

SpongebobSquamirez commented 5 years ago

@javdejong Could you get back to me about my last message (above)? I didn't tag you initially so I'm not sure if you saw it.

javdejong commented 5 years ago

@weirdalsuperfan Yes, the reading field used by the Japanese support addon. The braces should indeed be ignored (or just included in the colorization). I haven't built any colorization myself yet, as it was not as pressing as the other fixes/improvements were.

As far as my desired output is concerned, it's basically just the video that you linked. There the reading field is colorized, not the expression. We could of course do both, but that might be a little overkill.

EDIT: Another reason we want to use the information in the reading field, is that that takes away (some of) the ambiguity in what pronunciation to use for a certain word (if kanji compound corresponds to multiple hiragana spellings). If the hiragana spelling is the same but there are still multiple pronunciations (odaka and heiban or something), then I'm not sure what to do yet.