Fixed issues w/working on multiple words

SpongebobSquamirez commented 6 years ago

There are 2 main issues I've found with the pronunciation add-on, but I've found a way to fix them with some tweaks.

1) The card type name itself needs to have the word "japanese" somewhere in it. 2) The expression has to be one word, which often means it's without kana (especially without grammatical stuff attached like particles), so things like this won't work (commas separate examples): 後で, 川上・川下 or other fields with punctuation, whole sentences, things with readings attached (because of the weird characters), e.g.: 後れた【おくれた】.

I've fixed the requirement of a japanese card type (just commented that stuff out), the unnecessary kana problem (for most use-cases), the punctuation/separators problem, and the problem of multiple words strung together. This fix obviously focuses on individual words rather than prosody (which you should use OJAD/suzuki-kun for). I made it so that it will try an English/Japanese punctuation split first. If this fails it will use the Japanese Support plugin's Mecab-type sentence-splitting (which will correctly look up words even if they are separated by kana).

Sentence-type splitting occasionally splits words too much, like 歩行者 becoming 歩行 and 者, so direct lookups with simple punctuation splits are preferable when possible.

I may add some more optimizations later to try to recombine overly split-level words, or implement some stuff from the issues section, but anyway now it's much more usable than before, and you should no longer have lookup problems as long as your words are in the NHK dictionary used here. Making these fixes took this add-on from being somewhat helpful if I got lucky to working fabulously for me.

javdejong commented 6 years ago

I think the "noteTypes" stuff has been addressed in the conversion to 2.1, where users can now configure the note type in the config.json (from the Anki UI).

I agree that better pronunciation for sentences or verbs would be very nice. I currently use Anki mainly to cram nouns, and just type the noun first in the "expression" field. Then it generates example sentences and the pronunciation. Then I change the expression to turn it into a short sentence (otherwise it becomes hard to remember in reverse).

You mentioned in one of your commits about having a separate "expression" and "sentence" field, and that it might be useful to generate pronunciations for both. That is definitely something we could/should do, especially if the sentence can be split with mecab (or we use this OJAD tool, if that's even allowed).

SpongebobSquamirez commented 6 years ago

Hey, thanks for making the updates to the 2.1 version. It would definitely be great if we could merge this 2.0 version with the new 2.1 version. People I've been talking to in anki forums and IRL value the multi-word/sentence splitting more than anything but there are also quite a few people starting anki who are using 2.1 and want this. I have to re-read my commit, but I just want to clarify that this version already does sentence/delimiter/Japanese vs non-Japanese splitting with mecab, regex, etc. Regarding OJAD, I may have said some conflicting things in earlier commits (because it was a WIP at the time), but basically I think the sentence pronunciation/OJAD comment I made was about prosody and OJAD/conjugation rather than the dictionary form accents that NHK gives, since it would be good to keep those fields separate (since prosody can be derived by learners from dictionary form accent if they know all the rules, but not vice versa).

I'm pretty busy during the week, and I'm glad you're going through these issues quickly, so I'll try to follow up better this weekend. It's just that I can't give any bandwidth to this during the weekdays. If I remember correctly I fixed all the known bugs and documented everything in the latest commit, but if you look at the code there are a couple old comments suggesting there are bugs (although I think those bugs are gone), an unused function, some commented out stuff that I just haven't properly deleted, etc. If we could double check that it's cleaned up first I think we could go from there to tweak the implementation, add features, and bring this to 2.1. I've been thinking about doing that myself just following anki's guidelines, but since I don't use 2.1 I think I'd need to install 2.1 on a VM to test properly, ya? And I just haven't felt like doing that yet. ^_^;

Using OJAD for studying purposes is allowed, and scraping isn't against their robots.txt, but because of the way it works, we would need to use selenium to scrape it, and we should throttle the requests; but both of those things will mean that generation is relatively slow. The "Awesome text to speech" addon (https://ankiweb.net/shared/info/301952613) already apparently has good prosody, but since it doesn't display it, it's not as educational as OJAD (which can generate audio as well anyway, making that other addon obsolete). By the way, in case you're not sure what I'm talking about, it's this website: http://www.gavo.t.u-tokyo.ac.jp/ojad/eng/phrasing/index (for prosody/sentence-level accent generation) http://www.gavo.t.u-tokyo.ac.jp/ojad/search (for conjugation accent search)

javdejong commented 6 years ago

I definitely don't want to put any pressure on you. I myself only recently got more time (and motivation) to work on this again ;) It'd be great if you could clean up the commits a little, and remove obsolete comments/etc. That'd make it easier for me to review/comment on.

I was looking into allowing both of us to push new revisions of this add-on to AnkiWeb, but I'm not sure that is possible. I can definitely give you commit rights to master though, if you feel comfortable with that. For small fixups/changes you can then commit to it at your discretion, although larger changes (like this MR) would definitely benefit from an additional set of eyes checking it.

SpongebobSquamirez commented 6 years ago

Yeah, I can clean up the files ASAP, likely sometime this weekend. I can't promise more than that in a timely manner, though, sorry to say. When I do find time to work on things it's usually patching things that are urgent, like tweaking other Anki plugins to get them to function properly for my decks, or automating other data collection for card creation; so it tends to be whatever bugs me/distracts me at the right moment. Not that I like to abandon my work. It's just that studying Japanese is my priority on the weekends. Anyway, thanks for understanding.

I actually got asked today by someone if I was going to make a new AnkiWeb page or update the old one, so he could download the updated version easily, and I didn't have a good answer for him, instead just redirecting him to this branch with directions about how to manually update. It'd be helpful though to at least have commit rights, to be safe, in case one of us falls off the face of the Earth. Also, I haven't looked at AnkiWeb's hosting, but I guess if the 2.1 version can be maintained to be backwards compatible then maybe we only need one AnkiWeb page, but if not then having a 2.1 and a 2.0 one might be good - but either way, if it's not easy to share access, I'll have to leave that in your hands.

javdejong commented 6 years ago

When uploading add-ons you have two fields; one for the 2.0 version and one for the 2.1 version. I only updated a new version for 2.1; the 2.0 version is stuck at the old version for now.

I'm not sure it's worth it to keep the add-on backwards compatible with 2.0 due to changes in Python and Qt. Might be more trouble than it's worth, but if there are enough people out there that cannot upgrade for some reason (?) we could continue to support 2.0.

EDIT: Asked a question about shared development here: https://anki.tenderapp.com/discussions/ankiweb/3357-share-development-on-add-on If it is not possible, we can think about supporting some update functionality like

import urllib.request
import zipfile

url = "https://github.com/javdejong/nhk-pronunciation/zipball/master"
urllib.request.urlretrieve(url, "master.zip")

with zipfile.ZipFile("master.zip", 'r') as z:
    z.extractall("new_version")

EDIT2: Well, it's not possible to give you rights to the AnkiWeb add-on, so we'll have to built our own update system. I was looking around a bit, and it seems the AwesomeTTS add-on has fancy updating capabilities, so we might be able to use that as a reference (if not almost entirely copy/paste). There might be others out there with similar capabilities as well which may be more suited for us to use.

SpongebobSquamirez commented 6 years ago

I cleaned up the code just now. You'll notice that the japanese note things is still commented out, but that's just because a lot of users didn't read the fine print and change their note types to work with that function.

About compatibility, I don't know if in the long term (or even right now) if it's worth having a backwards compatible version, but at the moment there are technically 2 functioning versions, so I think at least not breaking the 2.0 version would be good if handling the Qt code is too much of a pain to use a single program for both versions. Based on this thread (and I'm sure there are other, better indicators), plenty of people still use 2.0: https://www.reddit.com/r/Anki/comments/9nh13z/do_you_guys_use_anki_21_or_20_why/

Also, I use 2.0 lol. I think most people either care about 1) their plugins not having 2.1 versions (perhaps because the original creators are no longer maintaining them) or 2) things potentially breaking, especially because this is a "beta." I'm in the first camp, myself.

I spent a couple minutes looking at AwesomeTTS's update capabilities just now (sorry, I really only spent a couple minutes on it). Are you talking about having an updating system that pulls from somewhere other than Ankiweb? Also, I noticed they use Qt even in the 2.0 version.

Oh, one other thing. You mentioned splitting sentences with mecab, which is a fallback I added to improve splitting of the expression (though it imports mecab from the Japanese support addon, which I noticed you removed in the 2.1 update). Whether we want to include mecab natively with this addon would just be a question of removing dependencies (by importing directly rather than from the Japanese support directory) versus avoiding redundancy (by relying on the file in the Japanese support addon). It's just a guess, but I do think that pretty much every Japanese learner will have the Japanese support addon already.

Let me know what you think of the changes I made, and of adding the entire bs4 folder for just one function haha.

javdejong commented 6 years ago

I'll see if I can review somewhere in the coming week.

We can change the default of note type check if you want. I still want the option of people filtering on it though. I would say that an empty list means "all note types", and would then be the default.
OK. I will have to check how other add-ons support both versions though.
Update system: correct; we could make one that just pulls the latest tag (or just master) from GitHub.
I removed the Japanese import because I couldn't get the import to work anymore. I'm not sure if add-ons can import other add-ons anymore.

As far as removing html markup, something using the stdlib would have my preference. Although it may not be as good as BeautifulSoup, it might be good enough:

import sys

if sys.version_info.major == 2:
    from HTMLParser import HTMLParser
else:
    from html.parser import HTMLParser

class HTMLTextExtractor(HTMLParser):
    def __init__(self):
        if issubclass(self.__class__, object):
            super(HTMLTextExtractor, self).__init__()
        else:
            HTMLParser.__init__(self)
        self.result = []

    def handle_data(self, d):
        self.result.append(d)

    def get_text(self):
        return ''.join(self.result)

def html_to_text(html):
    s = HTMLTextExtractor()
    s.feed(html)
    return s.get_text()

SpongebobSquamirez commented 6 years ago

Okay, sounds good. I'll try to test that HTML markup removal code this weekend. I'm currently using the same HTML parser with bs4, so it might work...というより、いけると良いですね。Looking into things a little more, the best parsers might be html5lib, lxml, or even html2text (though I'm not sure about that last one). https://www.reddit.com/r/Python/comments/1x8oar/a_quick_note_on_my_pleasant_experience_with/

I've had good experiences with lxml, though. But html5lib is pure python. Your solution is short and sweet, though, in terms of file size. Let me know if you have any thoughts on this. I don't plan on doing very extensive testing, tbh.

I'm looking into maybe improving the sentence parser. If we're going to move away from the Japanese support plugin, it would be a good opportunity to see if it can be improved on if we're going to include it natively. According to my coworkers (I work in Japan), Mecab is the best, or at least the most-well known, Japanese sentence tokenizer; but I need to look into the Japanese support addon's implementation settings. It's possible that they aren't optimal, and while I'm at it I figure I'll do some tests with the other well-known tokenizers to see if the parsing can't be improved upon. After all, our goal is to split it up so it can be looked up in a dictionary. Mecab (and maybe other splitters) apparently can also give the citation form of words. There might also be splitters/settings to split compound words up in a way that the NHK dictionary will be able to handle, even if some other dictionaries list definitions for the larger compound separately, or a way to try both the compound and the component words, etc.

By the way, how did you get the CSV file for the dictionary? Did you parse the actual EPWING file, or did you run across the CSV online somewhere? It's rare, but occasionally on lookups I'll run across a word that's not in NHK, but is in 三省堂 / 新明解国語辞典, which also lists accent (though it uses the number system). I could try using the zero-EPWING parser to make another CSV, or a master CSV, if you think it'd be worthwhile to combine the two during lookups (e.g. Sanseido could be used as a backup). I'm planning on getting some familiarity with either zero-EPWING or goldendict's EPWING parsing anyway for my own plugin (I want to try to extract the audio from my NHK accent dictionary), so I might have time to look into that. Still, it might only yield a 10% increase in the number of covered words, or something...

javdejong commented 5 years ago

The CSV file is just an export of the Access database used by some old NHK Accent Dictionary software. I don't think I ever had to deal with EPWING. If there are words missing (or other pronunciations possible), I'm very much in favor of making some master CSV. But you're probably right that the effort of merging/adding pronunciations is not worth the effort.. especially for more obscure words, I doubt the Japanese themselves even care/know (not to mention that accent varies from region to region anyway).

EDIT: I'll spend tomorrow trying to get Anki 2.0 support back in master, and then reviewing your stuff (probably also by committing things on top for you to check in turn).