realm / realm-object-server

Tracking of issues related to the Realm Object Server and other general issues not related to the specific SDK's
https://realm.io
293 stars 42 forks source link

Realm Object Server was running, now won't restart #255

Closed misterlib closed 7 years ago

misterlib commented 7 years ago

Everything was working fine, no spikes shown in server. Has failed and won't restart

Goals

Start the realm object server.

Expected Results

Server starts

Actual Results

It says in the status that it can't restart, that there is an Error: Not a valid RSA public key

Steps to Reproduce

sudo systemctl start realm-object-server

Code Sample

Version of Realm and Tooling

Logs

systemctl status realm-object-server.service ● realm-object-server.service - Realm Object Server Loaded: loaded (/etc/systemd/system/realm-object-server.service; enabled; vendor preset: disabled) Active: failed (Result: start-limit) since Mon 2017-08-21 23:53:59 UTC; 5min ago Process: 31605 ExecStart=/usr/bin/realm-object-server -c /etc/realm/configuration.yml (code=dumped, signal=ABRT) Main PID: 31605 (code=dumped, signal=ABRT) Aug 21 23:53:58 li515-11.members.linode.com systemd[1]: Unit realm-object-server.service entered failed state. Aug 21 23:53:58 li515-11.members.linode.com systemd[1]: realm-object-server.service failed. Aug 21 23:53:59 li515-11.members.linode.com systemd[1]: realm-object-server.service holdoff time over, scheduling restart. Aug 21 23:53:59 li515-11.members.linode.com systemd[1]: start request repeated too quickly for realm-object-server.service Aug 21 23:53:59 li515-11.members.linode.com systemd[1]: Failed to start Realm Object Server. Aug 21 23:53:59 li515-11.members.linode.com systemd[1]: Unit realm-object-server.service entered failed state. Aug 21 23:53:59 li515-11.members.linode.com systemd[1]: realm-object-server.service failed.
sudo journalctl -u realm-object-server.service Aug 18 12:38:03 li515-11.members.linode.com realm-object-server[24602]: <--- Last few GCs ---> Aug 18 12:38:03 li515-11.members.linode.com realm-object-server[24602]: 225163286 ms: Mark-sweep 1249.9 (1298.9) -> 1248.8 (1299.9) MB, 42.7 / 0.0 ms [allocation failure] [GC in old space requested]. Aug 18 12:38:03 li515-11.members.linode.com realm-object-server[24602]: 225163324 ms: Mark-sweep 1248.8 (1299.9) -> 1248.8 (1268.9) MB, 38.1 / 0.0 ms [last resort gc]. Aug 18 12:38:03 li515-11.members.linode.com realm-object-server[24602]: 225163366 ms: Mark-sweep 1248.8 (1268.9) -> 1248.6 (1268.9) MB, 41.1 / 0.0 ms [last resort gc]. Aug 18 12:38:03 li515-11.members.linode.com realm-object-server[24602]: <--- JS stacktrace ---> Aug 18 12:38:03 li515-11.members.linode.com realm-object-server[24602]: ==== JS stack trace ========================================= Aug 18 12:38:03 li515-11.members.linode.com realm-object-server[24602]: Security context: 0x15e70e4cfb39 Aug 18 12:38:03 li515-11.members.linode.com realm-object-server[24602]: 1: DoJoin(aka DoJoin) [native array.js:~129] [pc=0x985795c302d] (this=0x15e70e404381 ,w=0xab1ae756101 ,x=3,N Aug 18 12:38:03 li515-11.members.linode.com realm-object-server[24602]: 2: Join(aka Join) [native array.js:180] [pc=0x98579557f92] (this=0x15e70e404381 ,w=0xab1ae756101
cat /var/log/realm-object-server.log [realmobjectserverlog.txt](https://github.com/realm/realm-mobile-platform/files/1240436/realmobjectserverlog.txt)
misterlib commented 7 years ago

Not sure if that actually uploaded the last log. Just in case, here it is:

realmobjectserverlog.txt

misterlib commented 7 years ago

Looks like the server was out of space.

I am looking into resizing it. But I have a question about the tmp files.

`df -h Filesystem Size Used Avail Use% Mounted on /dev/root 95G 90G 0 100% / devtmpfs 3.9G 0 3.9G 0% /dev tmpfs 3.9G 0 3.9G 0% /dev/shm tmpfs 3.9G 394M 3.6G 10% /run tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup tmpfs 799M 0 799M 0% /run/user/0

du -hsx * | sort -rh | head -10 80G tmp 8.5G 0 2.7M internal_data 4.0K user_data [root@li515-11 object-server]# pwd /var/lib/realm/object-server`

Looks like the /var/lib/realm/object-server/tmp file is 88G.

Does that need to remain or is that something that gets purged over time? Can I manually dump that? Just want to know how to maintain it in a safe way.

Thanks.

kvap commented 7 years ago

@misterlib This is just temporary data which can be deleted. But it will be better if you store this data somewhere else at least for some time, in case the internal clients in object-server had't been able to upload all changes to the sync server before crashing. If they are way ahead of the sync server, then you can lose some user auth data.

misterlib commented 7 years ago

Thanks for your reply @kvap.

Can you explain a bit further what you mean by user auth data? Does that mean that they won't be able to login?

Also, when you say "way ahead" I'm assuming that you mean that they've made a bunch of changes on their device offline (due to server being down or just no connectivity). When they reconnect, those changes wouldn't be able to be synced?

Also, this seems to be an issue because the server has crashed, but can you offer any advice on when this could be cleaned in general? Or does that happen on a schedule?

Just trying to get some more clarification on what may happen after considering the use cases for my users.

kvap commented 7 years ago

@misterlib Parts of the object server also use the sync server internally as clients. They keep "local" synchronized realms in that tmp dir. It should be safe to delete those files, the ones that are needed will be re-created. There can be a rare case when a user has registered, the auth part of the object server has put the relevant info into tmp, but it has not yet synced that internally. If you delete tmp at that point, you lose that user account. The same can happen to permission updates.

Since the server won't start unless you delete something, delete the oldest stuff in that tmp directory, it has more chances of having been synced and/or become unneeded. It is not cleared on schedule.

misterlib commented 7 years ago

Ok, I upgraded my server on linode from:

Linode 8GB | 8 GB | 4 Cores | 96 GB SSD | 4 TB | 40 Gbps | 1000 Mbps to Linode 24GB | 24 GB | 8 Cores | 384 GB SSD | 16 TB | 40 Gbps | 2000 Mbps

The server is still consistently crashing whenever a large spike hits it, particularly when my users are first getting set up. They may have a 50-20mb realm, and it hits the server and it spikes and just starts shutting down over and over.

OR, instead of the initial upload, the initial download causes the same thing. A second device will connect to the ROS and need to download, and it crashes the server a few times and then will start transferring the data.

I've now cleared out my temp files at least twice because they are all almost 2gb each.

But my overall transfer for the month is less than 30gb. (we just launched our service last Saturday).

Clearly, something is wrong. Do I need a ton more ram? Do I just load up the tmp folder and delete files everyday and hope that the server just reboots and will start transferring data again. There's got to be a better solution.

Any advice will help. Thanks!

kvap commented 7 years ago

Does it keep crashing for the same reason (no disk space)? If not, it would be interesting to know why it crashes, any log messages or core dumps.

kvap commented 7 years ago

@misterlib It would be great if our team could get ssh access to the server. Please write to help@realm.io if you would like us to do that.

misterlib commented 7 years ago

We are redoing stuff. This shouldn't be an issue. I'm going to close this issue and discuss our plan for fixing this problem in a new issue and make sure that it gets handled correctly.