qibinghua commented 9 years ago

两台db服务器:内网连接,master占用55G,slave占用了500G. slave开始拒绝连接.

日志看起来起一切正常.,因为开的info级别,最后的日志如下:

-----------------------
2014-12-03 04:28:54.837 [INFO ] slave.cpp(238): sync_count: 400465001, last_seq: 753993012, seq: 753993013
2014-12-03 04:29:23.063 [INFO ] slave.cpp(238): sync_count: 400466001, last_seq: 753994012, seq: 753994013
2014-12-03 04:29:48.592 [INFO ] slave.cpp(238): sync_count: 400467001, last_seq: 753995012, seq: 753995013
2014-12-03 04:30:14.119 [INFO ] slave.cpp(238): sync_count: 400468001, last_seq: 753996012, seq: 753996013
2014-12-03 04:30:33.338 [INFO ] slave.cpp(238): sync_count: 400469001, last_seq: 753997012, seq: 753997013
2014-12-03 04:31:02.166 [INFO ] slave.cpp(238): sync_count: 400470001, last_seq: 753998012, seq: 753998013
2014-12-03 04:31:40.642 [INFO ] slave.cpp(238): sync_count: 400471001, last_seq: 753999012, seq: 753999013
2014-12-03 04:32:11.840 [INFO ] slave.cpp(238): sync_count: 400472001, last_seq: 754000012, seq: 754000013
2014-12-03 04:32:26.860 [INFO ] slave.cpp(238): sync_count: 400473001, last_seq: 754001012, seq: 754001013
2014-12-03 04:32:49.484 [INFO ] slave.cpp(238): sync_count: 400474001, last_seq: 754002012, seq: 754002013
2014-12-03 04:33:16.416 [INFO ] slave.cpp(238): sync_count: 400475001, last_seq: 754003012, seq: 754003013
2014-12-03 04:33:30.921 [INFO ] ssdb-server.cpp(211): ssdb working, links: 0
2014-12-03 04:33:49.154 [INFO ] slave.cpp(238): sync_count: 400476001, last_seq: 754004012, seq: 754004013
2014-12-03 04:34:17.087 [INFO ] slave.cpp(238): sync_count: 400477001, last_seq: 754005012, seq: 754005013
2014-12-03 04:34:33.011 [INFO ] slave.cpp(238): sync_count: 400478001, last_seq: 754006012, seq: 754006013
2014-12-03 04:34:56.736 [INFO ] slave.cpp(238): sync_count: 400479001, last_seq: 754007012, seq: 754007013
2014-12-03 04:35:23.755 [INFO ] slave.cpp(238): sync_count: 400480001, last_seq: 754008012, seq: 754008013
2014-12-03 04:35:51.115 [INFO ] slave.cpp(238): sync_count: 400481001, last_seq: 754009012, seq: 754009013
2014-12-03 04:36:13.660 [INFO ] slave.cpp(238): sync_count: 400482001, last_seq: 754010012, seq: 754010013
2014-12-03 04:36:33.440 [INFO ] slave.cpp(238): sync_count: 400483001, last_seq: 754011012, seq: 754011013
2014-12-03 04:36:56.267 [INFO ] slave.cpp(238): sync_count: 400484001, last_seq: 754012012, seq: 754012013
--------------------

另外一个同学也是这样的情况 ,也没有发现ERROR错误

ideawu commented 9 years ago

配置文件贴一下, 还有你能想到的所有信息, 你的操作, 你看到的, 你想到的, 不要吝惜.

zhangshuao commented 9 years ago

配置文件如下：

# cat ssdb.conf
# ssdb-server config
# MUST indent by TAB!

# relative to path of this file, directory must exists
work_dir = ./var
pidfile = ./var/ssdb.pid

server:
        ip: 1.1.1.1
        port: 8888
        # bind to public ip
        #ip: 0.0.0.0
        # format: allow|deny: all|ip_prefix
        # multiple allows or denys is supported
        #deny: all
        #allow: 127.0.0.1
        #allow: 192.168

replication:
        slaveof:
                # to identify a master even if it moved(ip, port changed)
                # if set to empty or not defined, ip:port will be used.
                id: svc_2
                # sync|mirror, default is sync
                type: mirror
                ip: 2.2.2.2
                port: 8888

logger:
        level: info
        output: log.txt
        rotate:
                size: 1000000000

leveldb:
        # in MB
        cache_size: 500
        # in KB
        block_size: 32
        # in MB
        write_buffer_size: 64
        # in MB
        compaction_speed: 1000
        # yes|no
        compression: no

互为主从。 mirror模式。

ideawu commented 9 years ago

信息不够, 请继续提供信息.

ideawu commented 9 years ago

请提供如下信息:

"系统"的拓扑结构
master 的 ssdb.conf
slave 的 ssdb.conf
连接 master 和 slave 执行 info 命令
什么时候发现问题? 发现之前是什么情况?
发现问题前做了什么?

qibinghua commented 9 years ago

运行环境

目前整个db是跑在阿里云的服务器上的.配置比较高8核8G的,临时磁盘. 阿里云网络高峰时期偶尔会抖动

架构变化

最早是双主结构,A1和A2,后来因为A1和A2配置比较低,并且是云磁盘io比较差.所以上了2台新的机器M1和S1

先是将A1的数据同步到了M1,然后将A1下架.再将M1和A2做了双master,S1做slave,后来觉得A2没有必要,就把A2下架了.然后M1一直会报连到A2错误的log..但服务一直正常.

M1和S1也运行一直正常.S1也能正常同步到数据.

接着在第三台机器上运行定时脚本从这个slave做dump备份,每天凌晨3点开始.

中间将SSDB从1.6.6升级到了1.6.8

当前架构

    WEB-------write------>M1-----------sync------->S1
                                                   |
    WEB<------read---------------------------------

业务场景

主要存储聊天的消息记录(hset),联系人的存储(zset),还有就是一些帖子的看过记录(zset) Master平均总共一天估计有100w-200w的写入

问题描述

因为S1没做监控,所以磁盘出现满的时间未知,通过日志是看到在2014-12-03 4点多开始,log里再没有日志,S1开始出现refused connect.. 这个时候M1写入是正常

发现问题前的操作

在前面升级到1.6.8后,一直没对SSDB本身以及配置做任何更改,服务如以前一样稳步运行.

出现问题后操作

将S1的读取通过本地hosts指向切换到M1,并将S1的log文件拷贝出来后,删除了整个data目录,将S1的配置文件里compaction_speed这做了修改后重启S1.重启后同步数据正常.

slave上的info命令结果(重启S1后的结果)

version
    1.6.8.8
links
    1
total_calls
    300148
key_range.kv
    "test" - "test20140722"
key_range.hash
    "bl:u:1000030" - "wbsm:u"
key_range.zset
    "bl:m:1000007" - "sv:u:999"
key_range.list
    "" - ""
leveldb.stats
                                   Compactions
    Level  Files Size(MB) Time(sec) Read(MB) Write(MB)
    --------------------------------------------------
      0        0        0       720        0     40269
      1        5      136      2159    93786     92960
      2       51     1578     10483   400010    391177
      3      491    15994      1309    50801     45835
      4      251     7937         0        0         0

17 result(s) (0.001 sec)

slave配置文件(S1)

# ssdb-server config
# MUST indent by TAB!

# relative to path of this file, directory must exists
work_dir = /data1/ssdb
pidfile = /data1/ssdb/ssdb.pid

server:
        ip: 0.0.0.0
        port: 8001
        # bind to public ip
        #ip: 0.0.0.0
        # format: allow|deny: all|ip_prefix
        # multiple allows or denys is supported
        deny: all
        allow: 127.0.0.1
        allow: 10.

replication:
        binlog: yes
        # Limit sync speed to *MB/s, -1: no limit
        sync_speed: -1
        slaveof:
                # to identify a master even if it moved(ip, port changed)
                # if set to empty or not defined, ip:port will be used.
                #id: svc_2
                # sync|mirror, default is sync
                type: sync
                ip: 10.161.245.161
                port: 8001

logger:
        level: info
        output: /data1/logs/log.txt
        rotate:
                size: 1000000000

leveldb:
        # in MB
        cache_size: 2048
        # in KB
        block_size: 64
        # in MB
        write_buffer_size: 64
        # in MB
        compaction_speed: 1000
        #后面的配置我这里改成4096了.. 
        # yes|no
        compression: no

重启MASTER后的info命令

version
    1.6.8.8
links
    131
total_calls
    9030140
key_range.kv
    "test" - "test20140722"
key_range.hash
    "bl:u:1000030" - "wbsm:u"
key_range.zset
    "bl:m:1000007" - "sv:u:999"
key_range.list
    "" - ""
leveldb.stats
                                   Compactions
    Level  Files Size(MB) Time(sec) Read(MB) Write(MB)
    --------------------------------------------------
      0        0        0         2        0        35
      1        3      106        29       37       577
      2       62     1572        72     1060       996
      3      620    15984      4220    45044     44988
      4     1199    38110      8703    96820     96699

17 result(s) (0.001 sec)

MASTER配置文件(M1)

# ssdb-server config
# MUST indent by TAB!

# relative to path of this file, directory must exists
work_dir = /data1/ssdb
pidfile = /data1/ssdb/ssdb.pid

server:
        ip: 0.0.0.0
        port: 8001
        # bind to public ip
        #ip: 0.0.0.0
        # format: allow|deny: all|ip_prefix
        # multiple allows or denys is supported
        deny: all
        allow: 127.0.0.1
        allow: 10.

replication:
        binlog: yes
        # Limit sync speed to *MB/s, -1: no limit
        sync_speed: -1
        slaveof:
                # to identify a master even if it moved(ip, port changed)
                # if set to empty or not defined, ip:port will be used.
                #id: svc_2
                # sync|mirror, default is sync
                #type: mirror
                #ip: 10.161.217.246
                #port: 8001

logger:
        level: info
        output: /data1/logs/log.txt
        rotate:
                size: 1000000000

leveldb:
        # in MB
        cache_size: 2048
        # in KB
        block_size: 64
        # in MB
        write_buffer_size: 64
        # in MB
        compaction_speed: 4196
        # yes|no
        compression: no

zhangshuao commented 9 years ago

配置文件如下： Master :

cat ssdb.conf

ssdb-server config

MUST indent by TAB!

relative to path of this file, directory must exists

work_dir = ./var pidfile = ./var/ssdb.pid

server: ip: 10.100.100.228 port: 8888

bind to public ip

    #ip: 0.0.0.0
    # format: allow|deny: all|ip_prefix
    # multiple allows or denys is supported
    #deny: all
    #allow: 127.0.0.1
    #allow: 192.168

replication: slaveof:

to identify a master even if it moved(ip, port changed)

            # if set to empty or not defined, ip:port will be used.
            id: svc_2
            # sync|mirror, default is sync
            type: mirror
            ip: 10.100.100.229
            port: 8889

logger: level: info output: log.txt rotate: size: 1000000000

leveldb:

in MB

    cache_size: 500
    # in KB
    block_size: 32
    # in MB
    write_buffer_size: 64
    # in MB
    compaction_speed: 1000
    # yes|no
    compression: no

Slaver :

cat ssdb.conf

ssdb-server config

MUST indent by TAB!

relative to path of this file, directory must exists

work_dir = ./var pidfile = ./var/ssdb.pid

server: ip: 10.100.100.229 port: 8889

bind to public ip

    #ip: 0.0.0.0
    # format: allow|deny: all|ip_prefix
    # multiple allows or denys is supported
    # deny: all
    # allow: 127.0.0.1
    # allow: 192.168

replication: slaveof:

to identify a master even if it moved(ip, port changed)

            # if set to empty or not defined, ip:port will be used.
            id: svc_1
            # sync|mirror, default is sync
            type: mirror
            ip: 10.100.100.228
            port: 8888

logger: level: info output: log.txt rotate: size: 1000000000

leveldb:

in MB

    cache_size: 500
    # in KB
    block_size: 32
    # in MB
    write_buffer_size: 64
    # in MB
    compaction_speed: 1000
    # yes|no
    compression: no

应用都是单节点写入的。读写请求都在100.228的机器上（主库）。100.229在这个架构中只负责实时同步。并没有提供插入功能。

日志无任何报错信息。

ideawu / ssdb

slave出现数据目录占用远超过master空间占用 #558

运行环境

架构变化

当前架构

业务场景

问题描述

发现问题前的操作

出现问题后操作

slave上的info命令结果(重启S1后的结果)

slave配置文件(S1)

重启MASTER后的info命令

MASTER配置文件(M1)

cat ssdb.conf

ssdb-server config

MUST indent by TAB!

relative to path of this file, directory must exists

bind to public ip

to identify a master even if it moved(ip, port changed)

in MB

cat ssdb.conf

ssdb-server config

MUST indent by TAB!

relative to path of this file, directory must exists

bind to public ip

to identify a master even if it moved(ip, port changed)

in MB