omemo / gajim-omemo

Gajim plugin for OMEMO Multi-End Message and Object Encryption
87 stars 7 forks source link

[Linux] Gajim freezes when sending an encrypted message #121

Closed jocelynthode closed 8 years ago

jocelynthode commented 8 years ago

General information

Sending a message should be instantaneous (as it is on Gajim windows). but strangely everytime I send a message, gajim freezes for 2seconds before delivering it which makes it unusable.

It might be related to Disk I/O. when sending an unencrypted message, iotop indicates the I/O is really quick and is at ~0.06% (writing to the db). When sending an encrypted message, the io to the OMEMO db is at 45%+.

Debug output

OMEMO didn't output anything in the logs. Here is a part of the exception that aren't related to OMEMO:

/usr/lib/python2.7/site-packages/OpenSSL/crypto.py:1108: UserWarning: implicit cast from 'char *' to a different pointer type: will be forbidden in the future (check that the types are as you expect; use an explicit ffi.cast() if they are correct)
  self._x509, digest, result_buffer, result_length)
/usr/lib/python2.7/site-packages/OpenSSL/rand.py:58: UserWarning: implicit cast from 'char *' to a different pointer type: will be forbidden in the future (check that the types are as you expect; use an explicit ffi.cast() if they are correct)
  result_code = _lib.RAND_bytes(result_buffer, num_bytes)
Exception in thread Thread-12:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/lib/python2.7/site-packages/gajim/common/gnupg.py", line 772, in _read_response
    result.handle_status(keyword, value)
  File "/usr/lib/python2.7/site-packages/gajim/common/gnupg.py", line 292, in handle_status
    raise ValueError("Unknown status message: %r" % key)
ValueError: Unknown status message: u'KEY_CONSIDERED'

Exception in thread Thread-31:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/lib/python2.7/site-packages/gajim/common/gnupg.py", line 772, in _read_response
    result.handle_status(keyword, value)
  File "/usr/lib/python2.7/site-packages/gajim/common/gnupg.py", line 292, in handle_status
    raise ValueError("Unknown status message: %r" % key)
ValueError: Unknown status message: u'KEY_CONSIDERED'
simonbru commented 8 years ago

There are many writes on ~/.local/share/gajim/omemo_thehost.db for each message sent. In particular I think the fdatasync syscall is called 2*(number of omemo keys) on this file, for every message sent. I also don't notice any lag when ~/.local/share/gajim/ is on a tmpfs.

lovetox commented 8 years ago

hm, if you are sure its because of I/O, im at a loss what can be done about it.

you can try to install python-cryptography, which gives you native encryption (fast), so you could rule out that.

jocelynthode commented 8 years ago

I have python-cryptography installed.

Edit: I'll try testing it on a system with ext4, to see if btrfs might be part of the problem or not

lovetox commented 8 years ago

so maybe its specific to btrfs?!

i wouldnt call that a global linux issue, i think its specific to your system. plugin is out for 6 months, and until now no one reported that behaviour.

all the plugin does is call standard sqlite operations.

maybe update sqlite? try installing the gajim on another harddisk

simonbru commented 8 years ago

I tried gajim + omemo on multiple machines and OSes, and I found that the lag was noticeable both on ArchLinux and Ubuntu 16.04. However it felt significantly worse on ArchLinux + btrfs + LUKS than on Ubuntu + ext4). I also tried running gajim on a portable Win10 installation on a slow USB2 HDD, and there the lag was barely noticeable (~ less than 0.5sec). My testing methodology wasn't very consistent, so don't count too much on these results. I worry that this issue is related to our XMPP server in some way.

I'll try to profile gajim disk and CPU usage in the hope to find more informative results. I may also try to use OMEMO with many clients on a public XMPP server supporting similar XEPs.

cippaciong commented 8 years ago

JFYI, I use Gajim + OMEMO on Arch Linux on different machines and I never noticed any lag. The setups span from low end Atom CPU to i5, HDD or SSD and I use both ext4 and btrfs (the latter only on SSD though), with LUKS encryption.

lovetox commented 8 years ago

your xmpp server has nothing to do with it, the process of sending a message does not depend on a working server.

does the problem really only exist if you write an encrypted message?

jocelynthode commented 8 years ago

Yes the problem only exists when sending encrypted messages

heythatsunfair commented 8 years ago

I am experiencing the same problem,

When debug is enabled, it takes about 2 seconds until "gajim.plugin_system.omemo Finished encrypting message" is shown in the log.

lovetox commented 8 years ago

if you know how to do this you could run gajim with cProfile, so then you could see which method needs so long

heythatsunfair commented 8 years ago

I dont know if i've done it correctly, but the output of this command python -m cProfile -s time /usr/share/gajim/src/gajim.py shows: 83 7.381 0.089 7.381 0.089 {method 'commit' of 'sqlite3.Connection' objects}

All other methods seem to be fine.

Edit: Delay is now ~500ms, better than half an hour ago.

lovetox commented 8 years ago

169 0.197 0.001 0.197 0.001 {method 'commit' of 'sqlite3.Connection' objects} you see i have double your calls, but need only a fraction of the time. either something is wrong with your sql3 verison, or maybe with your filesystem/harddisk

i tested this in virtual machine ubuntu 16.10 with an SSD

heythatsunfair commented 8 years ago

My version of sqlite3 is sqlite3.sqlite_version='3.11.0'. What version were you using?

My filesystem is ext4, I did a fresh install of Xubuntu just yesterday, HDD's SMART report does not show any abnormalities and a friend of mine had the same problems on another device with an SSD (same distribution though). However, in his case, the problem disappeared after a while. I'll try it on Arch tomorrow to see if this is a distro-specific issue.

heythatsunfair commented 8 years ago

Sorry I had no time, tested it today. No problems so far on Arch. Issue seems to be distro-specific or an upstream one.

jocelynthode commented 8 years ago

@heythatsunfair Probably not distro-specific as I'm using Arch too. my sqlite version: 3.15.0 2016-10-14 10:20:30 707875582fcba352b4906a595ad89198d84711d8

jocelynthode commented 8 years ago

@lovetox I get this: 22 5.960 0.271 5.960 0.271 {method 'commit' of 'sqlite3.Connection' objects}

It is at the top of the list ordered by internal time

lovetox commented 8 years ago

please test this with cprofile

import sqlite3
import os
from os.path import expanduser

home = expanduser("~")
db_path = os.path.join(home, 'test.db')
print(db_path)
conn = sqlite3.connect(db_path, check_same_thread=False)
conn.text_factory = bytes

create_tables = '''
                CREATE TABLE IF NOT EXISTS identities (
                    _id INTEGER PRIMARY KEY AUTOINCREMENT, recipient_id TEXT,
                    registration_id INTEGER, public_key BLOB, private_key BLOB,
                    next_prekey_id INTEGER, timestamp INTEGER, trust INTEGER,
                    shown INTEGER DEFAULT 0);
                '''

create_db_sql = """
    BEGIN TRANSACTION;
    %s
    PRAGMA user_version=0;
    END TRANSACTION;
    """ % (create_tables)
conn.executescript(create_db_sql)

def commit_row(string, int_):
    q = """INSERT OR REPLACE INTO identities (recipient_id, registration_id)
           VALUES (?, ?) """

    c = conn.cursor()
    c.execute(q, (string, int_))
    conn.commit()

x = 0
while x < 100:
    string = 'teststring' + str(x)
    commit_row(string, x)
    x += 1

i have the following result

with ubuntu 16.10 in a VM

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      100    0.067    0.001    0.067    0.001 {method 'commit' of 'sqlite3.Connection' objects}
jocelynthode commented 8 years ago

I ran python -m cProfile -s time test.py and python2.7 -m cProfile -s time test.py

With python 3.5.2 and sqlite 3.15.0:

100   39.543    0.395   39.543    0.395 {method 'commit' of 'sqlite3.Connection' objects}

With python 2.7.12 and sqlite 3.15.0:

100   36.258    0.363   36.258    0.363 {method 'commit' of 'sqlite3.Connection' objects}
lovetox commented 8 years ago

ok so im at a loss here, you see the python script has nothing to do with encryption, its just 100 commits of a string and an int into a sqlite db.

so there has to be something wrong, commiting 100 rows to a db can never take 39 seconds, i guess report a bug with your distribution?

jocelynthode commented 8 years ago

Maybe it is due to some sqlite configuration resulting in differences between our system? I tried running the script with Write-ahead logging enabled to see if it changed anything for me and with that I'm down to:

100    8.372    0.084    8.372    0.084 {method 'commit' of 'sqlite3.Connection' objects}

This is significantly slower in my case although it still big.

http://www.sqlite.org/wal.html

Anyway I'll try investigating further

jocelynthode commented 8 years ago

Update: I also set the PRAGMA synchronous = normal and I get the following results:

100    0.005    0.000    0.005    0.000 {method 'commit' of 'sqlite3.Connection' objects}

Edit: I wonder if some PRAGMA are disabled when omemo writes to the db, because normal messages cause no trouble.

lovetox commented 8 years ago

i think you have to set that on every connection to a DB new, default should be FULL

what is your default? in ubuntu its FULL

you basically reduced the writes to your disk, which doesnt really answer the question why it writes so slowly

lovetox commented 8 years ago

ok i found out, i just tried this with my non-ssd disk and it took over 10 seconds

jocelynthode commented 8 years ago

By default it seems to be journal_mode=memory and synchronous=FULL.

And yes I am not on a SSD.

lovetox commented 8 years ago

so i will add this to the plugin "PRAGMA synchronous=NORMAL;" "PRAGMA journal_mode=MEMORY;"

could you benchmark this without WAL?

jocelynthode commented 8 years ago

With:

conn.cursor().execute("PRAGMA synchronous = NORMAL;")
conn.cursor().execute("PRAGMA journal_mode = MEMORY;")

I still get: 100 12.043 0.120 12.043 0.120 {method 'commit' of 'sqlite3.Connection' objects}

With WAL it gets really fast.

lovetox commented 8 years ago

there are other considerations to make then speed. WAL is fast because it doesnt write to the database so often. that means also we could lose data in a crash or corrupt the database.

i will add

"PRAGMA synchronous=NORMAL;"
"PRAGMA journal_mode=MEMORY;"

that more than halfes the time for writing.

btw "PRAGMA journal_mode=WAL;" is a persistent setting, so you only have to execute it once on your DB and it will from then on work in WAL mode. that with the added synchronous=NORMAL i will add should get you what you want :)

lovetox commented 8 years ago

also initial 40 seconds is huge, is your harddisk old? or particular slow? i run this on my 3TB WD harddisk, and with default settings it took 10 seconds

jocelynthode commented 8 years ago

Thanks! It might be nice to have a warning somewhere in case people also experiment the slowness.

Yes my harddisk is 3+ years old.

lovetox commented 8 years ago

https://github.com/omemo/gajim-omemo/commit/a083f40aa391b7ba9566a71f721ec181ba145ee7

thanks for working this out !

jocelynthode commented 8 years ago

I upgraded right now and everything's working fine!