processone / ejabberd

Robust, Ubiquitous and Massively Scalable Messaging Platform (XMPP, MQTT, SIP Server)
https://www.process-one.net/en/ejabberd/
Other
6.11k stars 1.51k forks source link

Ejabberd Network throughput increased, when no cache is used #3920

Closed av-uc closed 1 year ago

av-uc commented 2 years ago

Before creating a ticket, please consider if this should fit the discussion forum better: https://github.com/processone/ejabberd/discussions

Environment

Configuration (only if needed): grep -Ev '^$|^\s*#' ejabberd.yml


###    trameters used in this configuration file are explained at
###
###       https://docs.ejabberd.im/admin/configuration
###
### The configuration file is written in YAML.
### *******************************************************
### *******           !!! WARNING !!!               *******
### *******     YAML IS INDENTATION SENSITIVE       *******
### ******* MAKE SURE YOU INDENT SECTIONS CORRECTLY *******
### *******************************************************
### Refer to http://en.wikipedia.org/wiki/YAML for the brief description.
###
hosts:
  - dev-pigeon.domain.com
loglevel: debug
log_rotate_count: 10
use_cache: false
default_ram_db: sql

oom_watermark: 50
listen:
  -
    port: 5281
    ip: "::"
    module: ejabberd_http
    request_handlers:
      /api: mod_http_api
  -
    port: 5280
    ip: "::"
    module: ejabberd_http
    request_handlers:
      /bosh: mod_bosh
      /websockets: ejabberd_http_ws
      /admin: ejabberd_web_admin

acl:
  admin:
    user:
      - "admin": "dev-pigeon.domain.com"
      - "bot_admin": "dev-pigeon.domain.com"
    ip: "0.0.0.0"
  local:
    user_regexp: ""
  loopback:
    ip:
      - "127.0.0.1/8"
      - "::1"
access:
  configure:
    admin: allow
access_rules:
  local:
    allow: local
  announce:
    allow: admin
  configure:
    allow: admin
  muc_create:
    allow: loopback
  pubsub_createnode:
    allow: local
  trusted_network:
    allow: loopback
api_permissions:
  "console commands":
    from:
      - ejabberd_ctl
    who: all
    what: "*"
  "admin access":
    who:
      - access:
          - allow:
            - acl: admin
      - oauth:
        - scope: "ejabberd:admin"
        - access:
          - allow:
              - acl: admin
    what:
      - "*"
      - "!stop"
      - "!start"
  "public commands":
    who: all
    what:
      - "status"
      - "register"
      - "connected_users_number"
      - "create_room"
      - "set_room_affiliation"
      - "subscribe_room"
      - "unregister"
      - "unsubscribe_room"
      - "create_room_with_opts"
      - "send_stanza"
      - "get_presence"
shaper:
  normal: 1000
  fast: 50000
shaper_rules:
  max_user_sessions: 10
  max_user_offline_messages:
    5000: admin
    100: all
modules:
  mod_adhoc: {}
  mod_admin_extra: {}
  mod_announce:
    access: announce
  mod_avatar: {}
  mod_blocking: {}
  mod_bosh: {}
  mod_caps: {}
  mod_carboncopy: {}
  mod_client_state: {}
  mod_configure: {}
  mod_disco: {}
  mod_fail2ban: {}
  mod_http_api: {}
  mod_http_upload:
    put_url: https://@HOST@:5443/upload
  mod_last: {}
  mod_mam:
    ## Mnesia is limited to 2GB, better to use an SQL backend
    ## For small servers SQLite is a good fit and is very easy
    ## to configure. Uncomment this when you have SQL configured:
    db_type: sql
    assume_mam_usage: true
    default: always
  mod_muc:
    access:
      - allow
    access_admin:
      - allow: admin
    access_create: muc_create
    access_mam:
      - allow
    default_room_options:
      mam: true
      persistent: true
      allow_subscription: true
    hibernation_timeout: 3600
    db_type: sql
    preload_rooms: false
  mod_muc_admin: {}
  mod_offline:
    store_empty_body: unless_chat_state
    store_groupchat: true
    access_max_user_messages: max_user_offline_messages
  mod_ping:
    send_pings: true
    ping_interval: 5
    ping_ack_timeout: 5
    timeout_action: kill
  mod_privacy: {}
  mod_private: {}
  mod_proxy65:
    access: local
    max_connections: 5
  mod_pubsub:
    access_createnode: pubsub_createnode
    plugins:
      - flat
      - pep
    force_node_config:
      ## Avoid buggy clients to make their bookmarks public
      storage:bookmarks:
        access_model: whitelist
  mod_push: {}
  mod_push_keepalive: {}
  mod_register:
    ## Only accept registration requests from the "trusted"
    ## network (see access_rules section above).
    ## Think twice before enabling registration from any
    ## address. See the Jabber SPAM Manifesto for details:
    ## https://github.com/ge0rg/jabber-spam-fighting-manifesto
    ip_access: trusted_network
  mod_roster:
    versioning: true
    db_type: sql
  mod_shared_roster: {}
  mod_stream_mgmt:
    resume_timeout: 60 sec
    resend_on_timeout: if_offline
  mod_vcard: {}
  mod_vcard_xupdate: {}
  mod_version:
    show_os: false
  #mod_cobrowser: {}
  mod_chatevents:
    auth_token: "secret"
    post_url: "url"
    confidential: true
  mod_recommendation:
    auth_token: "secret"
    post_url: "url"
    confidential: true
  mod_offline_http_post:
    auth_token: "secret"
    post_url: "url"
    confidential: true
  mod_read_markers:
    auth_token: "secret"
    post_url: "url"
    confidential: true
sql_type: mysql
sql_server: "dev-databases.domain.com"
sql_database: "ejabberd"
sql_username: "some"
sql_password: "some"
sql_port: 3306
default_db: sql
auth_method: sql
### Local Variables:
### mode: yaml
### End:

...

Errors from Monitoring

As traffic was increased on ejabberd instance, the network throughput and CPU utilisation of database increased.

Screenshot 2022-10-21 at 1 52 36 PM Screenshot 2022-10-21 at 1 53 00 PM Screenshot 2022-10-21 at 1 54 54 PM Screenshot 2022-10-21 at 1 55 14 PM

Bug description

We have recently gone live with the above config and it seems using - use_cache: false has left to an increase in database cpu. Network throughput also seems to have been increased.

nosnilmot commented 1 year ago

Of course if ejabberd is not caching data it is going to need to retrieve it from the DB more frequently, with a resulting increase in DB CPU and network utilization. Are you suggesting there should be no increase, that the increase is too large, or that the absolute CPU and network utilization is too high?

av-uc commented 1 year ago

@nosnilmot Yes, there will be increase in network throughput, however the CPU increase is nearly 4 times of current value. I have a few questions regarding this,

Context: Multiple dedicated EC2s, each having a dedicated Ejabberd instance with its own cache, but they are not in a cluster together. Network is open between these servers & they can communicate with each other.

Questions

  1. Can routing still take place for user sessions ?
  2. 2 users, each on a different server - can they chat with each other ? will 2 processes within these 2 EC2s facilitate this ?

Apart from this, should Ejabberd servers be run in a cluster mode when running them as containers over Docker or AWS ECS ? Is there a way to make them stateless and run these containers independently - so as to get fault tolerance, scalability on an infra level ?

It would be great if you could answer these or at least point me to some doc that can help me with these questions.

nosnilmot commented 1 year ago

Yes, there will be increase in network throughput, however the CPU increase is nearly 4 times of current value. I have a few questions regarding this,

If your users are mostly idle and the ejabberd traffic is dominated by new sessions, a 4-fold increase from disabling cache is not unexpected.

  1. Can routing still take place for user sessions ?
  2. 2 users, each on a different server - can they chat with each other ? will 2 processes within these 2 EC2s facilitate this ?

If all servers are configured as the same host(s), there will be no routing between them and users connected to different instances will not be able to chat. If the servers are configured with unique hosts then regular XMPP routing between domains will allow users of different servers to communicate (you may need the necessary s2s DNS SRV records in this case).

Apart from this, should Ejabberd servers be run in a cluster mode when running them as containers over Docker or AWS ECS ? Is there a way to make them stateless and run these containers independently - so as to get fault tolerance, scalability on an infra level ?

There is no one-size-fits-all correct configuration for an ejabberd deployment. Designing the architecture for your use-case is beyond the scope of this bug report, which does not appear to be a bug report anymore.

av-uc commented 1 year ago

Thank you for the input, i'll move further discussion related to this to the discussion forum.