processone / ejabberd

Robust, Ubiquitous and Massively Scalable Messaging Platform (XMPP, MQTT, SIP Server)
https://www.process-one.net/en/ejabberd/
Other
6.11k stars 1.51k forks source link

Database failure for Compliance Suite Test XEP-0384 #2726

Closed arendtio closed 5 years ago

arendtio commented 5 years ago

While trying to get my server to pass the XMPP compliance suite I have come to a point where I am not sure anymore where the problem originates from (a bug in ejabberd, a bug in the compliance suite or a misconfiguration on my side). Therefore, I am posting this bug report, without being 100% sure it is a bug within this project. If you think this is the wrong place for such an issue please say so.

Let's start with the setup:

I tried running the compliance suite via the web service as well as via the command line, with a fresh user I created for that purpose. In both cases, the test for 'XEP-0384 OMEMO Encryption' fails (most other tests pass).

AFAIK, the relevant configuration part looks just fine:

  mod_caps: {}
...
  mod_pubsub:
    access_createnode: pubsub_createnode
    ignore_pep_from_offline: false
    last_item_cache: false
    max_items_node: 1000
    default_node_config:
      max_items: 1000
    plugins:
      - "pep"
      - "flat"
    force_node_config:
      ## Change from "whitelist" to "open" to enable OMEMO support
      ## See https://github.com/processone/ejabberd/issues/2425
      "eu.siacs.conversations.axolotl.*":
        access_model: open
      ## Avoid buggy clients to make their bookmarks public
      "storage:bookmarks":
        access_model: whitelist

In order to dig a little deeper into the problem, removed all other tests from the compliance suite and saved the ejabberd.log that was generated while I ran the test. From my perspective, it looks like the relevant error message is this:

2018-12-15 12:08:39.934 [error] <0.598.0>@mod_pubsub:do_transaction:3643 transaction return internal error: {aborted,badarg}

The compliance suite on the other end seems to throw an exception when executing the following node.delete() line of the test:

java.util.concurrent.ExecutionException: rocks.xmpp.core.stanza.model.StanzaErrorException: internal-server-error  -  (type 'wait': retry after waiting (the error is temporary))
        Database failure
        at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395)
        at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999)
        at rocks.xmpp.util.concurrent.AsyncResult.get(AsyncResult.java:286)
        at im.conversations.compliance.xmpp.tests.OMEMO.run(OMEMO.java:83)
        at im.conversations.compliance.xmpp.TestExecutor.executeTestsFor(TestExecutor.java:52)
        at im.conversations.compliance.CommandLineLauncher.main(CommandLineLauncher.java:73)

(The line number of OMEMO.java doesn't match because I added some system.out.println for debugging purposes)

While looking for a solution I found #1531 which describes a similar issue but seems to be related to pqsql which is not present in this setup.

Does someone know what might cause those database failures?


Server configuration and the server.log (log-level 5):

ejabberd.log ejabberd.yml

pixel1138 commented 5 years ago

My config file is different, and doesn't mention switching the access model to "open". I did what the referred issue #2425 in that comment says to do which is to comment out those two lines: "eu.siacs.conversations.axolotl.*": access_model: whitelist

After that, I've not had any problems with OMEMO or the compliance tester.

arendtio commented 5 years ago

@pixel1138 I tried commenting those lines too, without noting any difference. I guess commenting out the lines was the old solution and setting it to 'open' is the new recommendation according to two commits by @licaon-kter. But as I said, I am just guessing.

@pixel1138 so you are running 18.12 with Mnesia too?

pixel1138 commented 5 years ago

@arendtio Oh, well. It was worth a shot, at least. Also, thanks for pointing out the relevant places where it was changed to recommend "access_model: open" instead of commenting out the lines. I'll make that update to my config.

@arendtio I'm running 18.12, but I'm using MySQL.

licaon-kter commented 5 years ago

@arendtio Open an issue here: https://github.com/iNPUTmice/caas/issues

@pixel1138 "open" vs commented, is a subtle difference, in the future "open" is the way to go. Also, as my commits mention, many failed to understand what the correct setup looks like to "support OMEMO".

arendtio commented 5 years ago

@licaon-kter as you suggested, I opened an issue at the compliance tracker but @iNPUTmice responded that the issue is not caused by the tester.

So it seems I am back to asking what causes the issue. Is there any way to find out what kind of 'Database failure' this is?

zinid commented 5 years ago

2018-12-15 12:08:39.934 [error] <0.598.0>@mod_pubsub:do_transaction:3643 transaction return internal error: {aborted,badarg}

@cromain can you please improve the code to avoid such meaningless error reports?

arendtio commented 5 years ago

Just to keep the issue up-to-date: The issue remains after upgrading to 18.12.1

To explore the possibility of a configuration error, I set up a second server with a very similar configuration (same Linux distribution, same package, different domain, fresh installation) but that new server doesn't have the issue and reaches 100% in the compliance test suite.

So my guess is that indeed the database of the original server has some kind of error in it. Any suggestions on how to find such an error? I am quite fluent with SQL, but don't know any Erlang and have no idea how to access Mnesia databases (yet).

arendtio commented 5 years ago

Ultimately, I solved the issue by deleting the mnesia pubsub file. This caused a bit of trouble with omemo, but a few days later everything looks fine now. So I don't know what caused the problem, but since I can't replicate the issue anymore, I will close the issue now.

One error message I came across when I tried to export with the broken pubsub file was the following:

[error] <0.29831.1>@ejd2sql:export:94 Failed export for module pubsub_db and table pubsub_node: function_clause

Maybe a hint cause the breakage.

cromain commented 5 years ago

reopening this issue to avoid meaningless error log in this case

cromain commented 5 years ago

@zinid badarg was caused by inconsistency in ejabberd_sql. I guess this one is resolved by your commit d411e68a2

zinid commented 5 years ago

Yes, I also improved error handling/reporting in recent commits. I think this issue can be closed.