Open hsanson opened 4 years ago
After more testing just calling :tag killed three times in sequence crashes alot. This happens on any search I have and using any tag.
Most likely some version issues with Xapian database but I am unable to debug this. I tried using python3-gdb but I get no symbols or backtrace to see what is going on. Any tips on how to debug this would be appreciated.
[~]> gdb /usr/bin/python3.6-dbg core
GNU gdb (Ubuntu 8.1-0ubuntu3.2) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/python3.6-dbg...done.
warning: core file may not match specified executable file.
[New LWP 17123]
Core was generated by `/usr/bin/python3 /home/ryujin/.local/bin/alot'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007f76cf7dbe97 in ?? ()
(gdb) py-bt
Unable to locate python frame
(gdb) bt
#0 0x00007f76cf7dbe97 in ?? ()
#1 0x0000000000000000 in ?? ()
(gdb)
After more investigation by setting debug messages all over the place I found the crash happens in this line: alot/ui.py#L655.
After setting a tag in a search buffer that causes that thread to be removed from the buffer, alot goes like this:
If the thread is not removed from the seach buffer then the crash does not occur. This can be tested by commenting out the searchbuffer.threadlist.remove() call.
This is a urwid MainLoop that is outside of my debugging skills. This same program path works fine twice but the third time always aborts. Most probably some race condition or similar going on inside that MainLoop due to the async/await being used.
For reference I tried this with urwid 2.0.1 and 2.1.0 versions and it crashes with both.
Thanks a lot for the debugging effort! It sounds like a problem that I had as well but I was not able to reproduce it reliably. I have the impression that it has something to do with the underlying notmuch library being called in an unsafe way, because no python library would simply segfault in this way without any error logs..
I'll try to look into it over the holiday break. Cheers, P
Quoting Horacio Sanson (2019-12-12 14:04:24)
After more investigation by setting debug messages all over the place I found the crash happens in this line: alot/ui.py#L655.
After setting a tag in a search buffer that causes that thread to be removed from the buffer, alot goes like this:
- Call TagCommand apply() method.
- After applying the tags it invokes FlushCommand with an inner refresh() function as callback.
- Since the message tagged no longer matches the search query, the refresh() method removes it by invoking searchbuffer.threadlist.remove() on the thread.
- The refresh() callback finishes by calling ui.update() that crashes (aborts) when it reaches draw_screen() in the Mainloop object.
If the thread is not removed from the seach buffer then the crash does not occur. This can be tested by commenting out the searchbuffer.threadlist.remove () call.
This is a urwid MainLoop that is outside of my debugging skills. This same program path works fine twice but the third time always aborts. Most probably some race condition or similar going on inside that MainLoop dot to the async/ await being used.
For reference I tried this with urwid 2.0.1 and 2.1.0 versions and it crashes with both.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.*
I can reproduce this as well, bisecting points at commit e7e0c52db9093a9ecd9dcaa0766e66515a546a75 as the culprit.
I keep hitting the problem as well, v0.9.1
. It's hard to reproduce reliably, especially when debugging is enabled, but I always hit the issue when toggling a tag (s).
The last seemingly relevant lines in the log are:
DEBUG:globals:flush complete
DEBUG:ui:Got key (['s'], [115])
DEBUG:ui:cmdline: 'toggletags unread'
DEBUG:ui:search command string: "toggletags unread"
DEBUG:__init__:mode:search got commandline "toggletags unread"
DEBUG:__init__:ARGS: ['toggletags', 'unread']
DEBUG:__init__:cmd parms {'flush': 'True', 'tags': 'unread', 'action': 'toggle'}
DEBUG:search:all? False
DEBUG:search:q: (tag:unread) AND thread:0000000000002773
DEBUG:manager:write-out item: ('untag', <function Thread.remove_tags.<locals>.myafterwards at 0x7fa4d15855e0>, 'thread:0000000000002773', ['unread'])
DEBUG:manager:cmd created
DEBUG:manager:got write lock
DEBUG:manager:got atomic
DEBUG:manager:ended atomic
DEBUG:manager:ended atomic
DEBUG:manager:closed db
DEBUG:manager:<function Thread.remove_tags.<locals>.myafterwards at 0x7fa4d15855e0>
DEBUG:manager:called callback
DEBUG:manager:flush finished
DEBUG:manager:write-out item: ('tag', <function Thread.add_tags.<locals>.myafterwards at 0x7fa4d1585790>, 'thread:0000000000002773', [])
DEBUG:manager:cmd created
DEBUG:manager:got write lock
DEBUG:manager:got atomic
DEBUG:manager:ended atomic
DEBUG:manager:ended atomic
DEBUG:manager:closed db
DEBUG:manager:<function Thread.add_tags.<locals>.myafterwards at 0x7fa4d1585790>
DEBUG:search:remove thread from result list: thread:0000000000002773: [RSS][hacker-news][11/28] SIGGRAPH 2020 Technical Paper: N-Dimensional Rigid Body Dynamics
Let me know what I could do to narrow down the issue, or even if there's a hot fix to try out.
HTH.
So you also see an abrupt segfault of the underlying python interpreter, correct? I'd really like to know what causes this or even how to debug this. Again, I'm quite sure that it is down to the notmuch bindings...
Quoting Patrick Totzke (2020-05-08 21:36:39)
Again, I'm quite sure that it is down to the notmuch bindings...
Most likely. The memory safety of the python binding is highly suspect. In my local alot fork, this goes away by switching to notmuch2 bindings.
-- Anton Khirnov
@elenril do you have a patch of your fix I could try, please?
Quoting Frank LENORMAND (2020-05-09 07:49:30)
@elenril do you have a patch of your fix I could try, please?
My branch has diverged from master quite considerably, so testing just this fix is not possible. Aalso, I am becoming less sure of my previous comment - this might be a problem in notmuch itself, not the bindings.
As far as I can tell, the crash is an unhandled Xapian::DatabaseModifiedError, triggered by the database being modified while a query is being read.
But you can simply revert e7e0c52.
-- Anton Khirnov
Could you today your findings too the notmuch mailing list, @lenormf ? if much rather fix the bindings that reverting to this arcane work around in separate processes.
I will, if I have information not already reported here.
I see that users in this thread are involved in notmuch
's CFFI bindings. What is the state of those, are they currently in a good-enough state that they could be used instead of the regular bindings?
Could it be a better move to have a new alot
branch that uses notmuch2
bindings, which allegedly don't have the crash issue? We would only have to live with the crash for as long as it took to get notmuch2
up to par (as far as I understand, they would deprecate the old ones), and this could bolster improvements and more generally upstream contributions to notmuch2
.
Quoting Frank LENORMAND (2020-05-09 10:48:33)
Could it be a better move to have a new
alot
branch that usesnotmuch2
bindings, which allegedly don't have the crash issue? We would only have to live with the crash for as long as it took to getnotmuch2
up to par (as far as I understand, they would deprecate the old ones), and this could bolster improvements and more generally upstream contributions tonotmuch2
.
I retract my comment from above - the problem seems NOT to be in the python bindings, so changing those would not fix anything. I can actually reproduce the same crash (at least I think it's the same crash) just with the notmuch tool in the following way:
The first crash will abort. IIUC it is caused by Xapian throwing the DatabaseModifiedError exception, which is not handled by libnotmuch. The solution is fixing libnotmuch to handle it. That might require API changes, since I think there is no way for a threads or messages iterator to report and error. The caller (i.e. alot in this case) would also have to re-execute the query.
-- Anton Khirnov
I ran gdb
to try to get a more precise stacktrace, this is what I got:
#0 0x00007ffff788ace5 in raise () at /usr/lib/libc.so.6
#1 0x00007ffff7874857 in abort () at /usr/lib/libc.so.6
#2 0x00007ffff534a0ff in () at /usr/lib/libtalloc.so.2
#3 0x00007ffff534a0f7 in () at /usr/lib/libtalloc.so.2
#4 0x00007ffff54615ab in notmuch_thread_destroy () at /usr/lib/libnotmuch.so.5
#5 0x00007ffff6183a8d in () at /usr/lib/libffi.so.7
#6 0x00007ffff618301b in () at /usr/lib/libffi.so.7
[Python interpreter calls…]
#184 0x00007ffff7cd3b38 in PyRun_SimpleFileExFlags (fp=0x5555555b7240, filename=<optimized out>, closeit=1, flags=0x7fffffffe498) at Python/pythonrun.c:428
#185 0x00007ffff7e3e7a8 in pymain_run_file (cf=0x7fffffffe498, config=0x55555557de50) at Modules/main.c:381
#186 pymain_run_python (exitcode=0x7fffffffe490) at Modules/main.c:565
#187 Py_RunMain () at Modules/main.c:644
#188 0x00007ffff7e3e899 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:698
#189 0x00007ffff7876023 in __libc_start_main () at /usr/lib/libc.so.6
#190 0x000055555555507e in _start
notmuch_thread_destroy(thread)
simply calls talloc_free(thread)
, which raises SIGABRT
(corrupted memory? double free? race?).
With @elenril's post, I think there's now enough information to close this issue and handle the crash on the notmuch
side.
Quoting Frank LENORMAND (2020-05-09 12:45:09)
I ran
gdb
to try to get a more precise stacktrace, this is what I got:#0 0x00007ffff788ace5 in raise () at /usr/lib/libc.so.6 #1 0x00007ffff7874857 in abort () at /usr/lib/libc.so.6 #2 0x00007ffff534a0ff in () at /usr/lib/libtalloc.so.2 #3 0x00007ffff534a0f7 in () at /usr/lib/libtalloc.so.2 #4 0x00007ffff54615ab in notmuch_thread_destroy () at /usr/lib/libnotmuch.so.5 #5 0x00007ffff6183a8d in () at /usr/lib/libffi.so.7 #6 0x00007ffff618301b in () at /usr/lib/libffi.so.7 [Python interpreter calls…] #184 0x00007ffff7cd3b38 in PyRun_SimpleFileExFlags (fp=0x5555555b7240, filename=<optimized out>, closeit=1, flags=0x7fffffffe498) at Python/pythonrun.c:428 #185 0x00007ffff7e3e7a8 in pymain_run_file (cf=0x7fffffffe498, config=0x55555557de50) at Modules/main.c:381 #186 pymain_run_python (exitcode=0x7fffffffe490) at Modules/main.c:565 #187 Py_RunMain () at Modules/main.c:644 #188 0x00007ffff7e3e899 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:698 #189 0x00007ffff7876023 in __libc_start_main () at /usr/lib/libc.so.6 #190 0x000055555555507e in _start
notmuch_thread_destroy(thread)
simply callstalloc_free(thread)
, which raisesSIGABRT
(corrupted memory? double free? race?).With @elenril's post, I think there's now enough information to close this issue and handle the crash on the
notmuch
side.
That seems like a different problem though. The one I am seeing is triggered in xapian, whereas in your backtrace the abort() comes from libtalloc, which is used by libnotmuch itself. That actually could be caused by the bindings managing memory incorrectly.
-- Anton Khirnov
Correct, but in any case this is a notmuch
issue, and doesn't need to be handled here.
Quoting Frank LENORMAND (2020-05-09 13:14:41)
Correct, but in any case this is a
notmuch
issue, and doesn't need to be handled here.
Not necessarily. The current python bindings are quite finicky about being used in exactly the right manner, it could be that alot is not handling some detail correctly. I think that was one of the motivations for replacing them with notmuch2.
-- Anton Khirnov
Fair enough.
@elenril it doesn't look like the Xapian::DatabaseModifiedError
abortion was reported upstream, have you notified them about it?
Quoting Frank LENORMAND (2020-05-18 12:05:13)
@elenril it doesn't look like the
Xapian::DatabaseModifiedError
abortion was reported upstream, have you notified them about it?
Yes, I have discussed it with developers on IRC and intend to send a patch fixing it.
-- Anton Khirnov
Great to hear, thanks!
I've been running into the same problem, consistent crashes after a tag command causes a thread to be removed from a search buffer, requiring more threads to be loaded. (It doesn't happen if the complete search results are shown, so one workaround is to keep your inbox under control!)
Quoting @elenril:
That seems like a different problem though. The one I am seeing is triggered in xapian, whereas in your backtrace the abort() comes from libtalloc, which is used by libnotmuch itself. That actually could be caused by the bindings managing memory incorrectly.
For me it's also usually an abort() from libtalloc.
Quoting @lenormf:
notmuch_thread_destroy(thread) simply calls talloc_free(thread), which raises SIGABRT (corrupted memory? double free? race?).
It actually seems to be a use after free. If you set e.g. TALLOC_FREE_FILL=5
, you'll get an out of bounds read a litte earlier:
#0 strlen (str=0x505050505050505 <Address 0x505050505050505 out of bounds>)
at /usr/src/lib/libc/string/strlen.c:100
#1 0x000000080871e9d8 in z_get (ptr=0x7ffffffb53a0, size=8) at cfield.c:1346
#2 0x0000000808715078 in GetResult (restype=0x80691b440, result=0x7ffffffb53a0, checker=0x0)
at callproc.c:921
#3 0x0000000808714758 in _ctypes_callproc (pProc=0x809bcd3a0 <notmuch_thread_get_thread_id>,
argtuple=0x80da90680, flags=4353, argtypes=0x807e3db50, restype=0x80691b440, checker=0x0)
at callproc.c:1223
Which comes from the yield t.get_thread_id()
in alot's DBManager,get_threads
. I'm not sure where that thread object (or its parent threads or query) actually got freed -- it's not coming from the _destroy
finalizer calls in the notmuch python bindings. I suspect it's after the db.close()
in DBManager.flush
. And Database.close
does have the warning:
This function closes the notmuch database. From that point on every method invoked on any object ever derived from the closed database may cease to function and raise a
NotmuchError.
Quoting @elenril:
Yes, I have discussed it with developers on IRC and intend to send a patch fixing it.
Well, even if notmuch is fixed so as not to abort, it still needs to be addressed on the alot side, whether you wait for an error to come back from notmuch or proactively cancel any open queries after writing to the db. In any case it breaks the search buffer's lazy loading mechanism...
Quoting Steven Lawrance (2020-06-18 20:30:47)
I've been running into the same problem, consistent crashes after a tag command causes a thread to be removed from a search buffer, requiring more threads to be loaded. (It doesn't happen if the complete search results are shown, so one workaround is to keep your inbox under control!)
In my experience it also does not happen in the last page: So if you have a list longer than one screenful, then an option is to work your way up from the bottom.
I'm also hitting this problem very frequently running alot from master. @elenril did you manage to write a patch for notmuch? do you need any help testing it or something?
Are people still hitting this? I just installed alot from FreeBSD ports and ran into it again, and remembered I had a small patch as a workaround:
diff --git a/alot/db/manager.py b/alot/db/manager.py
index d95ea332..713d8d73 100644
--- a/alot/db/manager.py
+++ b/alot/db/manager.py
@@ -277,8 +277,9 @@ class DBManager:
if exclude_tags:
for tag in exclude_tags:
q.exclude_tag(tag)
- for t in q.search_threads():
- yield t.get_thread_id()
+ thread_ids = [t.get_thread_id() for t in q.search_threads()]
+ for t in thread_ids:
+ yield t
def query(self, querystring):
"""
This just pre-loads all the thread IDs from a search query, so the search itself can be closed. It does slow things down noticeably when there are a lot (10k+) of results but it's at least usable for me.
Yes, I still suffer from this issue. Will try your patch - thanks!
your patch works well for me, @stlawrance - Thanks!
This is magic, I don't see why this should work but hey :) Can you send a PR? Thanks! P
I no longer experience this issue with release v0.10 (as packaged for Debian).
no, I was was mistaken (forgot that failure occured when _changing a listing - e.g. deleting entry - 3 times, not simply scrolling 3 pages): Issue still exists for me with release v0.10 :-(
This is an updated patch which (slows down but) avoids crashing when deleting an item 3 times in a list longer than 1 screen:
--- manager.py.orig 2021-09-26 19:40:37.000000000 +0200
+++ manager.py 2021-09-28 10:30:43.505410082 +0200
@@ -335,10 +335,11 @@
"""
assert sort in self._sort_orders
db = Database(path=self.path, mode=Database.MODE.READ_ONLY)
- for t in db.threads(querystring,
+ thread_ids = [t.threadid for t in db.threads(querystring,
sort=self._sort_orders[sort],
- exclude_tags=self.exclude_tags):
- yield t.threadid
+ exclude_tags=self.exclude_tags)]
+ for t in thread_ids:
+ yield t
def add_message(self, path, tags=None, afterwards=None):
"""
thank you @jonassmedegaard ! I forgot this thread and actually wanted to include the fix in the new release.. Anyhow, mind sending this as PR? that way you also get some credit :)
sorry, but I find Github unethical and only ever use their issue trackers.
If more helpful for you that I provide a git fork with the patch applied then I'd be happy to do so - but elsewhere, not on Github.
If you will only credit contributions provided as a Github PR then that's your choice - otherwise I am Jonas Smedegaard dr@jones.dk - that and more also provided at https://dr.jones.dk/info.
OK no worries.
I "credit" people by posting the output of git shortlog
in the announcement mails, so there'd have to be a commit by you.
But it's no big issue if you prefer not to post commits here.
Thanks anyway,
P
Quoting Jonas Smedegaard (2021-09-28 11:03:35)
sorry, but I find Github unethical and only ever use their issue trackers.
If more helpful for you that I provide a git fork with the patch applied then I'd be happy to do so - but elsewhere, not on Github.
If you will only credit contributions provided as a Github PR then that's your choice - otherwise I am Jonas Smedegaard @.*** - that and more also provided at https://dr.jones.dk/info.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. *
Sorry I didn't get around to adapting/testing the patch for current master, I'm glad it got there in the end though! Thanks.
@pazz Thanks for applying the patch (and even mentioning me in commit message - often that's not the case when I don't serve commits on a silver plate).
It seems, however, that the patch does not work as intended (I am really sorry - I did test and it did improve things, but later not - weird!).
It seems that the later commit a3b601813 makes my alot no longer crash, however.
So long story short: please simply revert commit caf87d9 and (unless others disagree) this issue can be closed - where credit goes to Kieran Bingham
I'm not sure I can see how my commit has an effect on this specific issue, except that it may have also been an issue that you would hit if you use "thread_subject = oldest" in your config, which would have prevented you from moving to the latest master otherwise.
That now uses the new notmuch2 bindings, which I suspect might have a lot more changes that could be relevant.
please disregard my previous post: I now again experienced crash after 3 times tagging a list more than a screenful in size, and re-applying patch caf87d9 made alot work again.
Perhaps the scenario triggered the crash is not simply "list of more than a screenful" but involves other factors as well.
As I said above in this thread - the problem is in libnotmuch, it cannot be fixed in alot. I intend to send a patch there to handle this.
Also, in my experience it happens a lot less frequently if you upgrade your notmuch database to glass.
Quoting elenril (2021-11-22 07:22:48)
Also, in my experience it happens a lot less frequently if you upgrade your notmuch database to glass.
I am not aware of this. How do I update notmuchs db? just by running notmuch compact?
Quoting Patrick Totzke (2021-11-23 10:33:04)
I am not aware of this. How do I update notmuchs db? just by running notmuch compact?
See https://xapian.org/docs/admin_notes.html#converting-a-chert-database-to-a-glass-database
But I hear dump-recreate-restore may be faster.
-- Anton Khirnov
Quoting elenril (2021-11-23 09:42:51)
Quoting Patrick Totzke (2021-11-23 10:33:04)
I am not aware of this. How do I update notmuchs db? just by running notmuch compact?
See https://xapian.org/docs/admin_notes.html# converting-a-chert-database-to-a-glass-database
Thanks. Debians xapian-examples actually only comes with the source and I cannot be bothered to find out how to compile this.
But I hear dump-recreate-restore may be faster.
That is notmuch dump?
Quoting Patrick Totzke (2021-11-23 11:13:30)
Quoting elenril (2021-11-23 09:42:51)
Quoting Patrick Totzke (2021-11-23 10:33:04)
I am not aware of this. How do I update notmuchs db? just by running notmuch compact?
See https://xapian.org/docs/admin_notes.html# converting-a-chert-database-to-a-glass-database
Thanks. Debians xapian-examples actually only comes with the source and I cannot be bothered to find out how to compile this.
the binary is shipped in xapian-tools
But I hear dump-recreate-restore may be faster.
That is notmuch dump?
Yes, like https://notmuchmail.org/howto/#index6h2
-- Anton Khirnov
coolcool, many thanks! P
Does anybody still hit this bug or now any updates? We have three different indicators/ideas how to fix this collected in this thread:
In #1647 it was suggested to revert caf87d9 as it slows down the buffer creation for search queries with many results. So the question is also: can anyone still reproduce this without caf87d9?
@elenril you proposed to send a patch upstream, how did that go?
Quoting Lucas Hoffmann (2024-09-21 08:42:58)
Does anybody still hit this bug or now any updates? We have three different indicators/ideas how to fix this collected in this thread:
- update your database to glass (https://github.com/pazz/alot/issues/1460#issuecomment-975199240)
- switch the python bindings from notmuch to notmuch2 which alot did in #1547 (and #1673)
These can only reduce the rate at which this happens, not eliminate it completely.
- preload threads into a list instead of using the generator, proposed by @jonassmedegaard and merge into alot in caf87d9
This would fix it, but I'd expect a high performance and responsiveness cost.
@elenril you proposed to send a patch upstream, how did that go?
It's been several years since my attempt so I don't remember the details anymore, but it turned out to be harder than I expected. I do intend to return to this issue eventually though, but can't say when.
-- Anton Khirnov
@elenril - Out of curiosity, do you have anything you can share that helps us understand what the issue is, and what'd be required to fix it (e.g., a mailing list thread)?
Describe the bug In search buffer if I run this command three times, alot aborts:
Software Versions
To Reproduce Steps to reproduce the behaviour:
:untag --no-flush inbox; untag --no-flush spam; tag killed
three timesError Log Log output below but there does not seem to be any errors in it.