smogon / pokemon-showdown

Pokémon battle simulator.
https://pokemonshowdown.com
MIT License
4.77k stars 2.79k forks source link

PS Server crashes on start #8320

Closed lolanchen closed 3 years ago

lolanchen commented 3 years ago

Hi there!

I'm running a PS on a Ubuntu 18.04 server for a little reinforcement learning project.
I tweaked the config a little bit, following advice from friends, to make PS handle large amount of simultaneous battles better than the default setting.
I threw 32 bots on the ladder playing against each other to gather data, and it worked fine for a few days, until it crashed today. After the crash, PS server cannot be started again, as everytime I try to do so the following error would occur.

c@cayenne2:~/workspace/dependecies/pokemon-showdown$ node --max-old-space-size=8192 pokemon-showdown
RESTORE CHATROOM: lobby
RESTORE CHATROOM: staff
events.js:292
      throw er; // Unhandled 'error' event
      ^

Error: write EPIPE
    at afterWriteDispatched (internal/stream_base_commons.js:154:25)
    at writeGeneric (internal/stream_base_commons.js:145:3)
    at Socket._writeGeneric (net.js:786:11)
    at Socket._write (net.js:798:8)
    at doWrite (_stream_writable.js:403:12)
    at writeOrBuffer (_stream_writable.js:387:5)
    at Socket.Writable.write (_stream_writable.js:318:11)
    at REPLServer._writeToOutput (readline.js:337:17)
    at REPLServer.Interface.prompt (readline.js:303:10)
    at REPLServer.displayPrompt (repl.js:1035:8)
Emitted 'error' event on Socket instance at:
    at errorOrDestroy (internal/streams/destroy.js:108:12)
    at onwriteError (_stream_writable.js:418:5)
    at onwrite (_stream_writable.js:445:5)
    at internal/streams/destroy.js:50:7
    at Socket._destroy (net.js:679:5)
    at Socket.destroy (internal/streams/destroy.js:38:8)
    at afterWriteDispatched (internal/stream_base_commons.js:154:17)
    at writeGeneric (internal/stream_base_commons.js:145:3)
    at Socket._writeGeneric (net.js:786:11)
    at Socket._write (net.js:798:8) {
  errno: 'EPIPE',
  code: 'EPIPE',
  syscall: 'write'
}

CRASH: Error [ERR_IPC_DISCONNECTED]: IPC channel is already disconnected
    at ChildProcess.target.disconnect (internal/child_process.js:832:26)
    at QueryProcessWrapper.destroy (/home/chen/research/pfrl/dependecies/pokemon-showdown/.lib-dist/process-manager.js:185:16)
    at QueryProcessWrapper.release (/home/chen/research/pfrl/dependecies/pokemon-showdown/.lib-dist/process-manager.js:171:9)
    at QueryProcessManager.releaseCrashed (/home/chen/research/pfrl/dependecies/pokemon-showdown/.lib-dist/process-manager.js:438:16)
    at ChildProcess.<anonymous> (/home/chen/research/pfrl/dependecies/pokemon-showdown/.lib-dist/process-manager.js:495:47)
    at ChildProcess.emit (events.js:315:20)
    at ChildProcess.EventEmitter.emit (domain.js:483:12)
    at finish (internal/child_process.js:861:14)
    at processTicksAndRejections (internal/process/task_queues.js:79:11)

SUBCRASH: Error: Unknown system error -122: Unknown system error -122, close

Worker 2 now listening on 0.0.0.0:8000
Test your server at http://localhost:8000
Worker 1 now listening on 0.0.0.0:8000
Test your server at http://localhost:8000
Worker 3 now listening on 0.0.0.0:8000
Test your server at http://localhost:8000
Worker 4 now listening on 0.0.0.0:8000
Test your server at http://localhost:8000

CRASH: Error: Unknown system error -122: Unknown system error -122, close

SUBCRASH: Error: Unknown system error -122: Unknown system error -122, close

CRASH: Error: Unknown system error -122: Unknown system error -122, close

CRASH: Error: Unknown system error -122: Unknown system error -122, close

SUBCRASH: Error: Unknown system error -122: Unknown system error -122, close

The version of showdown I'm using is commit 82006cf3089adce77f88b81740818ae2e4b779dd,
changes on configs/source code are as follows.

// config/config.js
exports.workers=4; //default 1
exports.simulatorprocesses=32; //default 16
// server/rooms.ts
const TIMEOUT_EMPTY_DEALLOCATE = 30 * 1000; // default 10 * 60 * 1000
// server/room-battle.ts
const DISCONNECTION_TIME =10; // default 60
const DISCONNECTION_BANK_TIME = 15; // default 300

How can I fix this ?

And besides, what else can I do to make PS's ladder better handle heavy loads coming from bots? (typically 100 battles / s and each battle lasts 30 ~ 100+ turns)

Thank you!

Zarel commented 3 years ago

I'm not very familiar with Linux, to be honest, but the error looks to me like PS is having trouble spawning processes. The errors "EPIPE" and "ERR_IPC_DISCONNECTED" are all errors that mean "we tried to communicate with another process and failed", and "Unknown system error -122" means "quota exceeded".

My best guess is that you're using too much of some system resource. "After the crash, PS server cannot be started again" tells me that you might still have running processes fighting for resources. Something like killall node might allow you to restart PS.

In the future, might I suggest using the simulator API directly?

https://github.com/smogon/pokemon-showdown/blob/master/sim/README.md

You can communicate with it using standard IO, and this will skip over a lot of overhead relating to server hosting and matchmaking.

lolanchen commented 3 years ago

Thanks for the quick answer! Killing all the node processes does solve the problem. Seems that it really was just a resource issue.

Yea I thought about using the simulator API in the beginning, but to implement a random matching system for a pool of python bots seems to be quite a bit of effort for someone who's never written a single line of Javascript in my whole life. The ladder has it implemented already, so I decided I might well just used the ladder.

Cheers!