laurent22 / joplin

Joplin - the secure note taking and to-do app with synchronisation capabilities for Windows, macOS, Linux, Android and iOS.
https://joplinapp.org
Other
44.21k stars 4.79k forks source link

Desktop: Joplin Freezing During Syncing and Decrypting On Linux Kernel 5.5+ (New Issue Tracker Report With Possible Fixes) #3114

Closed bedwardly-down closed 4 years ago

bedwardly-down commented 4 years ago

Please Read To Fix This Issue

As soon as you are able, Upgrade to Kernel 5.6.13. Joplin will not work properly otherwise due to all of the information listed below.

This issue has been resolved but I’m leaving it open for the time being only so users that need it still can find it while Kernel Upgrades roll out still.

Bug Report Proper Starts Here

Alright, so this is a continuation of #2518. The bug report became horribly bloated and unmanageable. This is an attempt to fix that.

Who Does This Bug Affect

Users on various different Linux distributions that all are running Linux Kernel 5.5 or 5.6 and their updates.

To find out what Kernel you are running and get other useful distribution information, please run in your terminal uname -a and copy the response here.

Arch Linux Fedora 31+ Solus Clear Linux Void Linux Debian Testing

What Happens With This Bug

Joplin's Sidebar, Notebook panel, and Notes freeze during various syncing tasks.

How to Reproduce This Bug

Perform normal tasks with Joplin and allow it to start syncing. It will eventually freeze during Syncing and / or Decrypting of notes. The best way to manually trigger this bug, just make some edits to a note and then start clicking the Synchronize button a few times until you can no longer interact with the UI.

Why Does This Bug Happen

The Linux Kernel maintainers pushed a commit that changed how asynchronous input / output worked, leading to Electron and a small handful other frameworks having this issue. It doesn't look like the issue will be fixed upstream anytime soon.

What Can Be Done To Solve The Issue

About My System

[I] justa@lame-ass-host~ []> uname -a Linux lame-ass-host 5.6.4-artix1-1 #1 SMP PREEMPT Fri, 17 Apr 2020 14:57:51 +0000 x86_64 GNU/Linux

My tests are also primarily using Nextcloud syncing target. I do have a davfs mount point setup, but due to the age of my system and possibly my filesystem type (btrfs), it is extremely slow and not at all ideal for FIlesystem syncing.

When Did The Bug Show Up First

Late January 25th Early 26th - credit @taw00

Current Tests

  1. Deadlock Tests - set various Max Connection Settings under Tools=>Options=>Synchronization=>Show Advanced Settings

  2. Daniel Souza's Code Fix Test - comment out / run if not Linux platform specific parts of synchronizer.js to see if that solves the issue

  3. Does Encryption matter for triggering this bug? - credits to @figue

  4. 5.7-RC5 test - waiting on confirmation from other users.

  5. Is bug affected by usage of Wayland?

  6. 5.6.13 with Epoll Fixes test - Switching to FIlesystem sync with a locally mounted Nextcloud instance solved my issue here. 5.6.13 seems to solve the core issues.

laurent22 commented 4 years ago

When switching the web clipper on and off, what happens is that a server is started then stopped. Do you have any idea why that would unfreeze the UI?

If we could understand better what's happening maybe we could add a workaround. Like now I'm thinking we could integrate some dummy server that the sync process would start and stop at regular intervals. Don't know if that would help but could be worth a try.

bedwardly-down commented 4 years ago

In the original bug report, there was some code work early on and one tester found that the synchronizer.js code could have parts of it commented out and the issue wouldn’t occur. I believe the part was where it was checking for changes with the server mainly. Another user commented that the way the code was written was part of why the kernel commit caused this problem. They theorized that the synchronization service was going to “sleep” and wasn’t being woken up like in 5.4 .

There wasn’t enough users working on that aspect of testing and my kernel research wasn’t really useful to them, so they eventually stopped and no one picked it up from there.

bedwardly-down commented 4 years ago

@hexclover what do you think of @laurent22‘s thoughts? Weren’t you the one that found the synchronizer.js stuff originally?

hexclover commented 4 years ago

Hmmm. I just found, by using strace -f -etrace=epoll_ctl, that toggling the web clipper server generates these calls to epoll_ctl reliably:

[pid 18485] epoll_ctl(3, EPOLL_CTL_ADD, 53, {EPOLLOUT, {u32=53, u64=53}}) = 0
[pid 18485] epoll_ctl(3, EPOLL_CTL_DEL, 53, 0x7ffd94c703e0) = 0
[pid 18485] epoll_ctl(3, EPOLL_CTL_ADD, 64, {EPOLLIN, {u32=64, u64=64}}) = 0
[pid 18485] epoll_ctl(3, EPOLL_CTL_DEL, 64, 0x7ffd94c713b0) = 0

The first 3 are issued by enabling and the last by disabling -- perhaps they are related to creating (destroying) the server and make it start (stop) listening on some port. It may help explain why this trick unfreezes the UI.

This still does not tell us whether Joplin/Node.js/Electron/... is to blame, though. I don't have much time to look into it.

BTW, ever after I managed to trigger the freeze once or twice today I fail to trigger it for a next time following the same steps. I think the lack of a reliable way to replicate the bug really adds to the difficulty in debugging.

P. S. I think I wasn't the first one to locate synchronizer.js, at least not the first one to mention it in #2518 :-)

bedwardly-down commented 4 years ago

You didn’t locate it, sure. Remember, i prodded you in the right direction with it since i had previously worked with it a bit? Either way, that information is extremely valuable and thanks for getting back to me, @hexclover. I hope you’re safe. ;)

Also, reliably triggering the bug could be an issue with the all of the updates the kernel developers make to "fix" bugs that appear from the previous patches.

EDIT: I can definitely verify that Joplin calls EPOLL while syncing. I found the Process ID for its main thread and during syncing, I got this output:

[I] justa@lame-ass-host~ []> pidof joplin 5950 5948 5935 5915 5913 5908 5905 [I] justa@lame-ass-host~ []> sudo strace -p 5950 -etrace=epoll_ctl [sudo] password for justa: strace: Process 5950 attached epoll_ctl(3, EPOLL_CTL_DEL, 56, 0x7ffc84f076e0) = 0 epoll_ctl(3, EPOLL_CTL_ADD, 56, {EPOLLOUT, {u32=56, u64=56}}) = 0 epoll_ctl(3, EPOLL_CTL_DEL, 45, 0x7ffc84f076e0) = 0 epoll_ctl(3, EPOLL_CTL_ADD, 45, {EPOLLOUT, {u32=45, u64=45}}) = 0 epoll_ctl(3, EPOLL_CTL_ADD, 56, {EPOLLIN, {u32=56, u64=56}}) = -1 EEXIST (File exists) epoll_ctl(3, EPOLL_CTL_MOD, 56, {EPOLLIN, {u32=56, u64=56}}) = 0 epoll_ctl(3, EPOLL_CTL_ADD, 45, {EPOLLIN, {u32=45, u64=45}}) = -1 EEXIST (File exists) epoll_ctl(3, EPOLL_CTL_MOD, 45, {EPOLLIN, {u32=45, u64=45}}) = 0 epoll_ctl(3, EPOLL_CTL_MOD, 56, {EPOLLIN, {u32=56, u64=56}}) = 0 epoll_ctl(3, EPOLL_CTL_MOD, 45, {EPOLLIN, {u32=45, u64=45}}) = 0 epoll_ctl(3, EPOLL_CTL_DEL, 49, 0x7ffc84f076e0) = 0 epoll_ctl(3, EPOLL_CTL_DEL, 52, 0x7ffc84f076e0) = 0

If you hadn't posted your output, I wouldn't have found out how to take it a bit further.

laurent22 commented 4 years ago

Could it be a deadlock issue when accessing many files simultaneously? I wonder if the bugs happen if you set the max number of simultaneous downloads to 1?

bedwardly-down commented 4 years ago

I can check.

laurent22 commented 4 years ago

Or conversely, maybe by increasing it to 20 or more it would be possible to consistently replicate the bug, which could make fixing it easier.

bedwardly-down commented 4 years ago

That's the Max Simultaneous Connections part under Synchronization, isn't it?

bedwardly-down commented 4 years ago

Setting it to a Max Connection of One is showing no signs of the bug. Just extra remote items created and deleted during its initial check. I create 123 duplicates of a test note with a simple image in it, fully synced, deleted, then fully synced again. After it finished syncing, it sent 124 remote items and then deleted all of them even though it had done that in the previous step.

EDIT: @hexclover, can you test this out with me? I'm getting some interesting results. If I set Max Simultaneous Connections to either 1 or 20, the bug isn't appearing at all so far while leaving it at the default 5 is definitely showing the issue. I wonder if there's something happening with a calculation or something because of that number or the default settings are where the issue lies?

I noticed awhile back that leaving Sync settings as DropBox without changing it threw a similar issue when not putting in any credentials or doing OAUTH or whatever it uses.

bedwardly-down commented 4 years ago

@laurent22, the issue finally happened at the end of a 1024 note (with a 5 MB video file attached for good measure) on 20. Using a strace like before but checking on all Joplin processes, when this bug shows up, Epoll calls are frozen until doing the Webclipper fix. It took almost 10 minutes for it to finally show up and when it did, the app completely froze for a couple of minutes. Settings still worked, so I was able to get it back up and running.

StanczakDominik commented 4 years ago

Finally, something I can actually help debug! I'll try Max Connections = 1 over the next few days and see whether the bug comes back.

bedwardly-down commented 4 years ago

Finally, something I can actually help debug! I'll try Max Connections = 1 over the next few days and see whether the bug comes back.

Testing is fully welcome. The fact that Laurent, the lead Joplin dev here, is on this is a good thing. Let's try to get this bug tackled in some form while he has time to do something about it. :D

Also, @StanczakDominik , can I get you to run uname -a in your terminal and have that pasted here so we can keep track of what systems are affected and have been tested on? Thanks.

StanczakDominik commented 4 years ago

But of course:

dominik@dell ~ % uname -a
Linux dell 5.6.6-arch1-1 #1 SMP PREEMPT Tue, 21 Apr 2020 10:35:16 +0000 x86_64 GNU/Linux

(I believe I still haven't rebooted after updating stuff, so I should probably do that...)

StanczakDominik commented 4 years ago

It has just frozen again, with File System/Syncthing for synchronization. I tried the Web Clipper toggle flick and it worked well to restore everything to normal, as usual.

bedwardly-down commented 4 years ago

Also, did the syncing freeze on a setting of 1 for you? @StanczakDominik

StanczakDominik commented 4 years ago

Yes, still keeping it on max connections = 1. Web clipper extension active, firefox turned on.

bedwardly-down commented 4 years ago

Firefox isn't actually required here, but glad to get that information. :smile_cat:

spktkpkt commented 4 years ago

I have the same issues and a max connection of 20 solves it for me for now. I'll test and see how it turns out.

$ uname -a
Linux dellicious 5.6.6-942.native #1 SMP Tue Apr 21 03:03:21 PDT 2020 x86_64 GNU/Linux
bedwardly-down commented 4 years ago

I have the same issues and a max connection of 20 solves it for me for now. I'll test and see how it turns out.

$ uname -a
Linux dellicious 5.6.6-942.native #1 SMP Tue Apr 21 03:03:21 PDT 2020 x86_64 GNU/Linux

Glad to hear. What distribution are you on so I can add it to the affected systems in original post?

Also, setting it to 20 drastically slowed down the bug appearing for me but it still appeared after something like 900 items synced,

spktkpkt commented 4 years ago

What distribution are you on

Clear Linux

Also, setting it to 20 drastically slowed down the bug appearing for me but it still appeared after something like 900 items synced

That might be the reason, i usually don't have that much to sync at once. I only have about 180 items (Markdown files) at the moment.

m-angelov commented 4 years ago

Hey, all. Glad to see the activity on this issue (and even more that you're doing well).

I've tried to set the max connections to 20, but around the 10th sync the app froze. I used the Clipper workaround, synced again and this time it froze on the 3rd attempt. It's doing the same thing with max connections set to 1. I have 440 notes and 67 resources, most of which are images.

I'm running 5.6.4-arch1-1, so it's not an optimal test, but at the moment I'm running a thing and I'll be able to restart sometime tomorrow :) I'll update you when I try the newest kernel.

bedwardly-down commented 4 years ago

@m-angelov, glad you’re here. Can’t wait. Also, how do you like the issue tracker upgrade?

ghost commented 4 years ago

Hi, I am also facing this bug on 5.6.5-arch3-1, I have not enabled syncing and yet still UI freezes. After reading the issue, I have set Max Connections = 1, so I will report back how it goes and will check logs constantly.

laurent22 commented 4 years ago

Using a strace like before but checking on all Joplin processes, when this bug shows up, Epoll calls are frozen until doing the Webclipper fix.

Just to be clear the fix is to start, then stop the web clipper, is that right? Do you need to wait a bit before the moment you start and then stop it?

I'll add a dummy background server on Linux that will be started and stopped at regular intervals while sync is active, so just want to make sure I'll replicate the same start/stop sequence.

bedwardly-down commented 4 years ago

Just to be clear the fix is to start, then stop the web clipper, is that right? Do you need to wait a bit before the moment you start and then stop it?

This is correct. And no time needed to wait. Just enable and disable before returning back to syncing.

m-angelov commented 4 years ago

It seems that things are moving forward, but I'll still give an update after upgrading everything. Kernel is 5.6.6-arch1-1, Joplin's version is 1.0.199 AppImage. Syncing with a local dir, MaxConnections set as 1. The issue is still manifesting.

When syncing there are no EPOLL events in strace, no matter if the sync is successful, or the issue is present. The WebServer gives this output:

### when started
epoll_ctl(3, EPOLL_CTL_ADD, 67, {EPOLLOUT, {u32=67, u64=67}}) = 0
epoll_ctl(3, EPOLL_CTL_DEL, 67, 0x7fff1844b680) = 0
epoll_ctl(3, EPOLL_CTL_ADD, 67, {EPOLLIN, {u32=67, u64=67}}) = 0

### when stopped
epoll_ctl(3, EPOLL_CTL_DEL, 67, 0x7fff1844c960) = 0

Some things, which unfortunately can't replicate, but can describe:

@bedwardly-down The tracker looks great, it's very well organized and clear. Thank you for investing so much time in this!

@laurent22 Thank you for turning your attention to this issue, which while not critical is very frustrating.

Stay safe!

spktkpkt commented 4 years ago

After some days of testing i have to say, to set the maximum connections to 20 isn't the solution. I encountered serval random sync/save problems. Sometimes everything seems alright, but then the GUI doesn't refresh anymore and changes are not saved. Slowly, the initial "Well ok, then I reload the WebClipper." changes into real frustration. To start and stop the WebClipper doesn't always solve the problem, sometimes i have to close and reopen Joplin. I use Joplin as my main note taking tool and to write How-Tos and other stuff, but for the last weeks it is pain to use Joplin, because you can't be sure that your written stuff will be saved or if it will get lost.

I don't know how, but let me know if i can help.

ghost commented 4 years ago

After some days of testing i have to say, to set the maximum connections to 20 isn't the solution. I encountered serval random sync/save problems. Sometimes everything seems alright, but then the GUI doesn't refresh anymore and changes are not saved. Slowly, the initial "Well ok, then I reload the WebClipper." changes into real frustration. To start and stop the WebClipper doesn't always solve the problem, sometimes i have to close and reopen Joplin. I use Joplin as my main note taking tool and to write How-Tos and other stuff, but for the last weeks it is pain to use Joplin, because you can't be sure that your written stuff will be saved or if it will get lost.

I don't know how, but let me know if i can help.

My situation is same as yours, max connections doesn't do anything, and WebClipper is always solution for me - after GUI freezes and I can't write, save or navigate through Joplin's notes.

StanczakDominik commented 4 years ago

Yeah, that's how it's gone for me as well. Max connections don't really help.

Perhaps another impactful factor might be that I'm using file system sharing with Syncthing?

spktkpkt commented 4 years ago

Perhaps another impactful factor might be that I'm using file system sharing with Syncthing?

Me too. I have Synchronization interval disabled and sync/save only with CTRL + s.

bedwardly-down commented 4 years ago

It happens to me on Nextcloud sync, so syncthing isn’t really a factor.

danisztls commented 4 years ago

With placeholders and testing I found that the syncing is hanging in this section of synchronyzer.js:

const listResult = await this.api().delta('', {
    context: context,

    // allItemIdsHandler() provides a way for drivers that don't have a delta API to
    // still provide delta functionality by comparing the items they have to the items
    // the client has. Very inefficient but that's the only possible workaround.
    // It's a function so that it is only called if the driver needs these IDs. For
    // drivers with a delta functionality it's a noop.
    allItemIdsHandler: async () => {
        return BaseItem.syncedItemIds(syncTargetId);
    },

    wipeOutFailSafe: Setting.value('sync.wipeOutFailSafe'),

    logger: this.logger(),
});

Also MaxConnections = 1 was not helpful at all. I'm syncing to folder with autosync disabled and I'm consistently reproducing the issue by manually syncing.

mareksamec commented 4 years ago

I use filesystem sync and I can confirm I have the same issue. Max connections does not help: Linux HP 5.6.7-arch1-1 #1 SMP PREEMPT Thu, 23 Apr 2020 09:13:56 +0000 x86_64 GNU/Linux

bedwardly-down commented 4 years ago

@danielsouzat , your findings match the findings of a couple other users in the old report. Are those comments your own or what’s in the code?

danisztls commented 4 years ago

@bedwardly-down what's in the code.

danisztls commented 4 years ago

Somewhat related is issue #2191. https://github.com/laurent22/joplin/issues/2191

A workaround would be to disable periodic sync but the app will sync every time the focus change to a different notebook or note regardless of the chosen setting.

The app is unusable right now in a production environment. Due to WireGuard merge in Linux 5.6, there's a big appeal to not stay in Linux 5.4.

bedwardly-down commented 4 years ago

Wireguard could definitely be something that would go hand in hand with Joplin. I just perused its website and i can see the appeal. https://www.wireguard.com/

danisztls commented 4 years ago

As this is definitely related to a unknown upstream issue I updated almost all dependencies to their latest versions. Had to fix a couple of issues but I got it all working or at least I got rid of all errors and warnings. That alone did not solve this issue but made it more difficult to trigger. Later I found a workaround for the issue and I'm not reproducing it anymore in my development environment.

I commented line 331 of synchronizer.js:

await this.checkSyncTargetVersion_();

Linux 5.6.8-1-MANJARO SMP PREEMPT GNU/Linux

laurent22 commented 4 years ago

Could someone confirm this fix? Does it work 100% when it's applied?

As mentioned in the PR we can't remove this line as it will be needed when the sync target structure gets upgraded, but maybe we can fix this some other way.

bedwardly-down commented 4 years ago

I’m heading to work but could try it out when i get home this evening. Also, @danielsouzat, have you attempted to see if maybe there’s a library out there that could be implemented only for the Linux client that might have this fixed?

Since this is a Linux only issue, there’s no point in taking a chance with breaking other platforms; your fix would break mobile along with Windows and Mac. Thoughts on this as an option, @laurent22 ?

danisztls commented 4 years ago

@laurent22 worked 100% of the while exhaustively syncing to folder and navigating through notes and notebooks. Still would be great if we knew it also works 100% for other sync drivers.

A better thing would be to upgrade the code but that may take a while as we don't know why it isn't working. I may look into it at another time. The version.txt value can be stored at memory at startup or when sync target changes. All that IO can be avoided.

@bedwardly-down I limited the workaround for linux platform with an if statement. And can limit it further with 'uname -r' so it will only target linux > 5.5.

bedwardly-down commented 4 years ago

@danielsouzat , I really should have checked the actual commit. I didn’t read the changes made just the comments. Again, maybe there’s a way to work around it with another library that only loads when Linux is the platform and kind of acts as a bandaid

rebelC0der commented 4 years ago

Same issue here on: 5.6.8-1-MANJARO Debain 10 Testing

Joplin is unusable on both. After running Joplin, first few syncs work just fine, but on a 4th or 5th sync it is stuck in constant sync and shows that it is trying to sync one note. You can still navigate notes but can't edit them, This issue is happening for last 2-3 releases.

Web Clipper is disabled. Full reinstall (with cache and old files cleaning) does not help.

figue commented 4 years ago

As this is definitely related to a unknown upstream issue I updated almost all dependencies to their latest versions. Had to fix a couple of issues but I got it all working or at least I got rid of all errors and warnings. That alone did not solve this issue but made it more difficult to trigger. Later I found a workaround for the issue and I'm not reproducing it anymore in my development environment.

I commented line 331 of synchronizer.js:

await this.checkSyncTargetVersion_();

Linux 5.6.8-1-MANJARO SMP PREEMPT GNU/Linux

The workaround doesn't work for me (Archlinux, fully upgraded and clean compilation with patch).

m-angelov commented 4 years ago

Commenting out this line doesn't work for me as well. Kernel 5.6.8-arch1-1, Joplin 1.0.204

Some (probably) unrelated things:

bedwardly-down commented 4 years ago

On the 1.0.199 JEX issue, I believe that was a known bug across all desktop platforms.

I haven't tested the workaround listed by @danielsouzat yet but it sounds like it's not viable either.

bedwardly-down commented 4 years ago

I can verify it doesn't work with Nextcloud sync. In fact, the app doesn't sync at all or attempt to do anything useful on my end.

figue commented 4 years ago

In my Linux laptop which I was testing the workaround, I have filesystem syncronization, but directory is in a Nextcloud folder. Is this related?

m-angelov commented 4 years ago

I looked over my notes from the previous tests, and a good place to check is: https://github.com/laurent22/joplin/issues/2518#issuecomment-590805717 and the following comments by me and others.

I commented out the whole DELTA part in the synchronizer.js, as this was solving the issue last time. I did ~40 syncs successfully, and then about 30 minutes later while editing a note, it got stuck. So I'm at a loss.

/rant - I started looking into Org-mode, the desperation is setting in :D