vesoft-inc / nebula

A distributed, fast open-source graph database featuring horizontal scalability and high availability
https://nebula-graph.io
Apache License 2.0
10.8k stars 1.2k forks source link

Occupy resources and nebula-storaged threads are killed #1302

Closed zdYng closed 4 years ago

zdYng commented 4 years ago

Occupy resources

When I started the docker version of nebula,

the resources occupied by docker increased directly from 4MB to 300MB.

The memory takes up 4MB before booting, and the CPU takes up 0%:

image

image

When I start nebula, start the command:

scripts/nebula.server start all

Docker takes up memory 159MB,CPU 120%:

image

Sometimes there are times when the port cannot be fully started: image

Start a few more times.

image

But sometimes it can't be started :

image

zdYng commented 4 years ago

image

image

image

image

As time goes on, memory consumption is also increasing!

Do nothing, memory consumption is increasing!

zdYng commented 4 years ago

image

zdYng commented 4 years ago

image

amber-moe commented 4 years ago

please try to check the log size and change the log level

whitewum commented 4 years ago

I think 300MB is not enough for the service. And I think the processes are OOM killed during starting. Can you provide more memory for the container? (About 4-8GB)

zdYng commented 4 years ago

However, I have not added data yet, just simply start the service.

zdYng commented 4 years ago

I will try again,

zdYng commented 4 years ago

Sorry, my cloud server only has 2G of memory and has been tested. It only takes a lot of resources (including cpu and memory) during the time when the docker is started. After that, it will also stabilize, and tested the python-side example, everything works fine. It’s just that the port will be killed after a while.

zdYng commented 4 years ago

image

image

image

image

For a while, nebula-storaged will be automatically killed, is the case?

I have given docker unlimited memory usage.

zdYng commented 4 years ago

When docker automatically kills the nebula-storage port, it needs to be restarted to continue normal use.

whitewum commented 4 years ago

For a while, nebula-storaged will be automatically killed, is the case?

I have given docker unlimited memory usage.


This is very strange. Can you provide storaged log (in /usr/local/nebula/log/).

BTW, is storaged killed by OOM, or crashed?

whitewum commented 4 years ago

When docker automatically kills the nebula-storage port, it needs to be restarted to continue normal use.

Right.

Sometimes, the port is occupied so storaged can't restart. And You need to wait a few seconds to retry

zdYng commented 4 years ago
[root@nebula logs]# ll
total 62428
lrwxrwxrwx 1 root root       63 Nov 21 08:12 nebula-graphd.ERROR -> nebula-graphd.nebula.invalid-user.log.ERROR.20191121-081220.128
lrwxrwxrwx 1 root root       62 Nov 21 08:09 nebula-graphd.INFO -> nebula-graphd.nebula.invalid-user.log.INFO.20191121-080939.128
lrwxrwxrwx 1 root root       65 Nov 21 08:12 nebula-graphd.WARNING -> nebula-graphd.nebula.invalid-user.log.WARNING.20191121-081220.128
-rw-r--r-- 1 root root     2461 Nov 21 09:05 nebula-graphd.nebula.invalid-user.log.ERROR.20191121-081220.128
-rw-r--r-- 1 root root 25818390 Nov 21 14:49 nebula-graphd.nebula.invalid-user.log.INFO.20191121-080939.128
-rw-r--r-- 1 root root     2461 Nov 21 09:05 nebula-graphd.nebula.invalid-user.log.WARNING.20191121-081220.128
lrwxrwxrwx 1 root root       61 Nov 21 08:09 nebula-metad.ERROR -> nebula-metad.nebula.invalid-user.log.ERROR.20191121-080939.56
lrwxrwxrwx 1 root root       60 Nov 21 08:09 nebula-metad.INFO -> nebula-metad.nebula.invalid-user.log.INFO.20191121-080939.56
lrwxrwxrwx 1 root root       63 Nov 21 08:09 nebula-metad.WARNING -> nebula-metad.nebula.invalid-user.log.WARNING.20191121-080939.56
-rw-r--r-- 1 root root     1195 Nov 21 09:04 nebula-metad.nebula.invalid-user.log.ERROR.20191121-080939.56
-rw-r--r-- 1 root root    43717 Nov 21 09:35 nebula-metad.nebula.invalid-user.log.INFO.20191121-080939.56
-rw-r--r-- 1 root root     1195 Nov 21 09:04 nebula-metad.nebula.invalid-user.log.WARNING.20191121-080939.56
lrwxrwxrwx 1 root root       65 Nov 21 09:08 nebula-storaged.ERROR -> nebula-storaged.nebula.invalid-user.log.ERROR.20191121-090803.711
lrwxrwxrwx 1 root root       64 Nov 21 09:07 nebula-storaged.INFO -> nebula-storaged.nebula.invalid-user.log.INFO.20191121-090759.711
lrwxrwxrwx 1 root root       67 Nov 21 09:08 nebula-storaged.WARNING -> nebula-storaged.nebula.invalid-user.log.WARNING.20191121-090803.711
-rw-r--r-- 1 root root   532745 Nov 21 08:12 nebula-storaged.nebula.invalid-user.log.ERROR.20191121-080940.159
-rw-r--r-- 1 root root   532616 Nov 21 09:04 nebula-storaged.nebula.invalid-user.log.ERROR.20191121-090304.455
-rw-r--r-- 1 root root   958600 Nov 21 09:08 nebula-storaged.nebula.invalid-user.log.ERROR.20191121-090803.711
-rw-r--r-- 1 root root  9586097 Nov 21 08:44 nebula-storaged.nebula.invalid-user.log.INFO.20191121-080939.159
-rw-r--r-- 1 root root  6772768 Nov 21 09:04 nebula-storaged.nebula.invalid-user.log.INFO.20191121-090303.455
-rw-r--r-- 1 root root 17519384 Nov 21 09:34 nebula-storaged.nebula.invalid-user.log.INFO.20191121-090759.711
-rw-r--r-- 1 root root   532842 Nov 21 08:12 nebula-storaged.nebula.invalid-user.log.WARNING.20191121-080939.159
-rw-r--r-- 1 root root   532616 Nov 21 09:04 nebula-storaged.nebula.invalid-user.log.WARNING.20191121-090304.455
-rw-r--r-- 1 root root   958600 Nov 21 09:08 nebula-storaged.nebula.invalid-user.log.WARNING.20191121-090803.711
-rw-r--r-- 1 root root     2325 Nov 21 09:05 stderr.log
-rw-r--r-- 1 root root        0 Nov 21 08:09 stdout.log
[root@nebula logs]# datetime
bash: datetime: command not found
[root@nebula logs]# date
Thu Nov 21 14:50:08 UTC 2019
zdYng commented 4 years ago

One of the ERROR files:

[root@nebula logs]# cat nebula-graphd.ERROR
Log file created at: 2019/11/21 08:12:20
Running on machine: nebula
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E1121 08:12:20.891650   264 InsertVertexExecutor.cpp:50] No schema found for person
E1121 09:00:30.294275   202 StorageClient.inl:102] Request to [0.0.0.0:44500] failed: N6apache6thrift9transport19TTransportExceptionE: Channel got EOF. Check for server hitting connection limit, server connection idle timeout, and server crashes.
E1121 09:01:18.951679   202 StorageClient.inl:102] Request to [0.0.0.0:44500] failed: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E1121 09:02:29.934098   202 StorageClient.inl:102] Request to [0.0.0.0:44500] failed: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E1121 09:03:21.914382   141 StorageClient.inl:102] Request to [0.0.0.0:44500] failed: N6apache6thrift9transport19TTransportExceptionE: Channel got EOF. Check for server hitting connection limit, server connection idle timeout, and server crashes.
E1121 09:04:39.400250   202 StorageClient.inl:102] Request to [0.0.0.0:44500] failed: N6apache6thrift9transport19TTransportExceptionE: Channel got EOF. Check for server hitting connection limit, server connection idle timeout, and server crashes.
E1121 09:04:57.863484   141 StorageClient.inl:102] Request to [0.0.0.0:44500] failed: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E1121 09:04:57.863958   202 StorageClient.inl:102] Request to [0.0.0.0:44500] failed: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E1121 09:05:04.963557   202 StorageClient.inl:102] Request to [0.0.0.0:44500] failed: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E1121 09:05:04.966099   141 StorageClient.inl:102] Request to [0.0.0.0:44500] failed: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
zdYng commented 4 years ago

nebula-storaged.ERROR:

E1121 09:08:17.128510   769 Part.cpp:63] Cannot fetch the last committed log id from the storage engine
E1121 09:08:17.128862   769 Part.cpp:63] Cannot fetch the last committed log id from the storage engine
E1121 09:08:17.128975   769 Part.cpp:63] Cannot fetch the last committed log id from the storage engine
E1121 09:08:17.129107   769 Part.cpp:63] Cannot fetch the last committed log id from the storage engine
E1121 09:08:17.129220   769 Part.cpp:63] Cannot fetch the last committed log id from the storage engine
E1121 09:08:17.129348   769 Part.cpp:63] Cannot fetch the last committed log id from the storage engine
E1121 09:08:17.129559   769 Part.cpp:63] Cannot fetch the last committed log id from the storage engine
E1121 09:08:17.129861   769 Part.cpp:63] Cannot fetch the last committed log id from the storage engine
E1121 09:08:17.130076   769 Part.cpp:63] Cannot fetch the last committed log id from the storage engine
E1121 09:08:17.130481   769 Part.cpp:63] Cannot fetch the last committed log id from the storage engine
E1121 09:08:17.130692   769 Part.cpp:63] Cannot fetch the last committed log id from the storage engine
E1121 09:08:17.130848   769 Part.cpp:63] Cannot fetch the last committed log id from the storage engine
E1121 09:08:17.130954   769 Part.cpp:63] Cannot fetch the last committed log id from the storage engine
E1121 09:08:17.131075   769 Part.cpp:63] Cannot fetch the last committed log id from the storage engine
E1121 09:08:17.131217   769 Part.cpp:63] Cannot fetch the last committed log id from the storage engine
E1121 09:08:17.131393   769 Part.cpp:63] Cannot fetch the last committed log id from the storage engine
E1121 09:08:17.131539   769 Part.cpp:63] Cannot fetch the last committed log id from the storage engine
E1121 09:08:17.131651   769 Part.cpp:63] Cannot fetch the last committed log id from the storage engine
E1121 09:08:17.131775   769 Part.cpp:63] Cannot fetch the last committed log id from the storage engine
E1121 09:08:17.131901   769 Part.cpp:63] Cannot fetch the last committed log id from the storage engine
E1121 09:08:17.132025   769 Part.cpp:63] Cannot fetch the last committed log id from the storage engine
zdYng commented 4 years ago
[root@nebula logs]# cat stderr.log
E1121 08:12:20.891650   264 InsertVertexExecutor.cpp:50] No schema found for person
E1121 09:00:30.294275   202 StorageClient.inl:102] Request to [0.0.0.0:44500] failed: N6apache6thrift9transport19TTransportExceptionE: Channel got EOF. Check for server hitting connection limit, server connection idle timeout, and server crashes.
E1121 09:01:18.951679   202 StorageClient.inl:102] Request to [0.0.0.0:44500] failed: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E1121 09:02:29.934098   202 StorageClient.inl:102] Request to [0.0.0.0:44500] failed: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E1121 09:03:21.914382   141 StorageClient.inl:102] Request to [0.0.0.0:44500] failed: N6apache6thrift9transport19TTransportExceptionE: Channel got EOF. Check for server hitting connection limit, server connection idle timeout, and server crashes.
E1121 09:04:39.400250   202 StorageClient.inl:102] Request to [0.0.0.0:44500] failed: N6apache6thrift9transport19TTransportExceptionE: Channel got EOF. Check for server hitting connection limit, server connection idle timeout, and server crashes.
E1121 09:04:57.863484   141 StorageClient.inl:102] Request to [0.0.0.0:44500] failed: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E1121 09:04:57.863958   202 StorageClient.inl:102] Request to [0.0.0.0:44500] failed: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E1121 09:05:04.963557   202 StorageClient.inl:102] Request to [0.0.0.0:44500] failed: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E1121 09:05:04.966099   141 StorageClient.inl:102] Request to [0.0.0.0:44500] failed: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
zdYng commented 4 years ago

My current time zone is CST,the docker 's time zone is UTC.

zdYng commented 4 years ago
[root@nebula nebula]# scripts/nebula.service status all
[INFO] nebula-metad: Running as 56, Listening on 45500
[INFO] nebula-graphd: Running as 128, Listening on 3699
[INFO] nebula-storaged: Exited
[root@nebula nebula]# ps -aux |grep "nebula"
root        56  0.3  0.9 487748 16016 ?        Ssl  08:09   1:22 /usr/local/nebula/bin/nebula-metad --flagfile /usr/local/nebula/etc/nebula-metad.conf
root       128  0.3  0.3 219140  5960 ?        Ssl  08:09   1:29 /usr/local/nebula/bin/nebula-graphd --flagfile /usr/local/nebula/etc/nebula-graphd.conf
root      1117  0.0  0.0   9092   668 pts/5    S+   14:59   0:00 grep --color=auto nebula
[root@nebula nebula]#
zdYng commented 4 years ago

image

image

My container does not have any restrictions on resources such as memory. I think the possibility of OOM is relatively large, but I also had a situation where the docker occupied more resources than 500M, and there was no similar problem. Can I optimize the storage layer to reduce the memory usage? I set the container crash to restart automatically.

whitewum commented 4 years ago

Thank you for your report.

Can we reach you by wechat (wexin)? Find and add friend NebulaGraphBot

whitewum commented 4 years ago

BTW, if your partition_num = 1024 ( when you create a new graph space), it takes about 4GB memory. You can change it to 10 to reduce memory usage (for a functional test)

zdYng commented 4 years ago

Yes, I have added NebulaGraphBot, I will modify partition_num to test it.

zdYng commented 4 years ago

My WeChat is qzd1460786009

whitewum commented 4 years ago

try partition_num = 1.