Deprecate and ask for help with AppleSpell provider

rrthomas commented 3 months ago

The AppleSpell provider seems to be buggy these days (see e.g. #391), and I don't use macOS. I'd be happy to look into this on a paid consultancy basis; otherwise, I'd be delighted to have help from a macOS-using developer.

In the mean time, I am minded to deprecate the provider, and put it in the same category as the Zemberek back-end: not built by default, you get a warning if you enable it, and if it crashes that's a shame, but not my problem.

nossralf commented 4 weeks ago

Since #391 was closed in favor of this, I just thought I'd add that I've managed to capture the same crash with debug symbols by using the reproduction steps in that issue. (I found that issue because I also have Emacs crashes via the Jinx package.)

I myself use Homebrew and I'm able to reproduce with the library built via Homebrew (using brew install --build-from-source --debug-symbols enchant to get debug symbols). I've also built enchant from Git and I get the same crash there.

I only have access to an x86 Mac at work (at home I run an M1 Mac and have no crash issues). The following debugger information is using Python 3.13 with debug symbols built by pyenv, enchant from Git (the master branch, compiled with --enable-relocatable --disable-dependency-tracking because that's what Homebrew uses), and the latest published version of pyenchant (3.3.2).

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x60)
  * frame #0: 0x000000010063ca02 libenchant-2.2.dylib`enchant_dict_finalize(obj=0x00006000016c9b60) at dict.vala:57:19 [opt]
    frame #1: 0x000000010063c6a5 libenchant-2.2.dylib`enchant_dict_unref(instance=0x00006000016c9b60) at dict.vala:44:3 [opt]
    frame #2: 0x0000000100639b9b libenchant-2.2.dylib`enchant_broker_new_dict(self=<unavailable>) at broker.vala:422:2 [opt]
    frame #3: 0x000000010060135a enchant_applespell.so`appleSpell_provider_request_dict(me=0x00006000019c8700, tag="zh") at applespell_checker.mm:304:24 [opt]
    frame #4: 0x000000010063a03d libenchant-2.2.dylib`_enchant_broker_request_dict(self=0x00006000032cf780, tag="zh", pwl=0x0000000000000000) at broker.vala:274:14 [opt]
    frame #5: 0x0000000100639d5e libenchant-2.2.dylib`enchant_broker_request_dict_with_pwl(self=0x00006000032cf780, composite_tag=<unavailable>, pwl=0x0000000000000000) at broker.vala:302:16 [opt]
    frame #6: 0x00007ff81c541882 libffi.dylib`ffi_call_unix64 + 82
...

It looks like this.session is null in the EnchantDict at the time of destruction which means this.session.owner will explode. But I may be misinterpreting things since I don't have any experience at all with Vala and very limited experience with C (limited enough to not understand how the generated C files and their typedefs actually work). LLDB at least says so:

(lldb) frame variable
(EnchantDict *) obj = 0x00006000016c9b60
(EnchantDict *) self = 0x00006000016c9b60
(EnchantProvider *) owner = NULL
(EnchantSession *) _tmp0_ = NULL
(EnchantProvider *) _tmp2_ = <variable not available>

(EnchantProvider *) _tmp1_ = <variable not available>

(lldb) p *self
(EnchantDict) {
  parent_instance = {
    g_class = 0x00006000030cc600
  }
  ref_count = 0
  priv = NULL
  user_data = 0x0000000000000000
  session = NULL
  check_method = 0x0000000000000000
  suggest_method = 0x0000000000000000
  add_to_session_method = 0x0000000000000000
  remove_from_session_method = 0x0000000000000000
  get_extra_word_characters_method = 0x0000000000000000
  is_word_character_method = 0x0000000000000000
}

rrthomas commented 4 weeks ago

Thanks for this, I'll have a look when I get a moment.

nossralf commented 3 weeks ago

I've managed to produce the crash at home as well, and have been poking at this more in LLDB. It's pretty clear that the crash happens because this.session is null. The address 0x60 in EXC_BAD_ACCESS is because the offset of the provider field in the session class struct (on the C level) is 0x60 bytes, so first this.session is loaded into a register, and then the attempt is to load the value at offset 0x60 from that value, and we end up at 0x0 + 0x60 = 0x60.

Fundamentally, the AppleSpell provider is different from e.g. Hunspell, Aspell in that it does no checking by itself whether the requested language exists or not during the dictionary request phase (in appleSpell_provider_request_dict). It always calls enchant_broker_new_dict. The others first check whether the requested language even exists, and if not, they return null and never call enchant_broker_new_dict which means there is no partially initialized data to finalize and the error does not occur.

I must admit I don't quite understand how this code path is triggered via Jinx in Emacs. All the above is based on the reproduction steps in #391 which uses pyenchant. On my x86 Mac, a single call to request_dict in the Broker class in pyenchant is enough to trigger the crash, but on my M1 Mac I have to run it twice, ignoring the first result (which is an exception complaining that the dictionary doesn't exist). I don't understand why this differs and haven't yet tried to figure out why (maybe it has to do when the Python interpreter disposes of the pyenchant data and the finalizer is triggered).

Anyhow, I figured I'd add some more context. Ultimately I don't know whether it's actually helpful or not. I will try to see if I can monkey around with the provider and do a similar existence check in appleSpell_provider_request_dict that's used in the other providers.

rrthomas commented 3 weeks ago

Thanks very much for your efforts; the information above does look useful if you don't manage to get any further yourself.

nossralf commented 3 weeks ago

I'm now confident I understand why this is happening. In appleSpell_provider_request_dict an EnchantDict is always created first (and thereby added to the broker's set of dictionaries, as it's created via EnchantBroker.new_dict).

But if the actual dictionary request then fails, the EnchantDict is freed immediately (despite the broker still having a reference to it!) and the function returns null. The null return value causes EnchantBroker._request_dict to not set dict.session, and once the EnchantDict destructor runs it will explode, since this.session is null.

If the EnchantDict destructor would not run almost immediately, there would be a chance of junk data being written to the freed memory I suppose. (In my testing, if I simply remove the call to g_free, the crash occurs once the broker is destructed, with the same cause as this.session is still null in all the EnchantDicts that are created despite no dictionary being available for the given code. So all it does is delay when the dict's destructor runs, at least in my testing scenario.)

The solution is to move the creation of the EnchantDict to occur only after all its prerequisites are guaranteed to exist in appleSpell_provider_request_dict (i.e. if allocating the AppleSpellDictionary and finding a dictionary are both successful). That will also remove the calls to g_free (dict) which happen right now in an attempt to clean up if one of these prerequisites fails to materialize.

I will send a PR with this fix if this sounds reasonable to you. My question is if you want the PR to only state that it fixes #391? This issue (meaning #392) is more broad than just this one bug and fixing the bug doesn't necessarily answer the questions posed here.

rrthomas / enchant

Deprecate and ask for help with AppleSpell provider #392