mozilla / snakepit

Machine learning job scheduler
Mozilla Public License 2.0
51 stars 16 forks source link

Core Dumped: Memory Allocation Issue #143

Closed JRMeyer closed 5 years ago

JRMeyer commented 5 years ago

@tilmankamp I think this is happening when I run two or more CPU / RAM intensive jobs on the same node:

This is from job 4120, which was run on the same node as job 4119. Both jobs are CPU / RAM intensive, and only 4119 survived.

This is happening with kenlm, before any deepspeech spins off.

[2019-03-01 22:05:19] [worker 0] /data/rw/home/kenlm/util/scoped.cc:20 in void* util::{anonymous}::InspectAddr(void*, std::size_t, const char*) threw MallocException because `!addr && requested'.
[2019-03-01 22:05:19] [worker 0] Cannot allocate memory for 94530609152 bytes in malloc
[2019-03-01 22:05:19] [worker 0] .compute: line 59:  2543 Aborted                 (core dumped) /data/rw/home/kenlm/build/bin/lmplz --skip_symbols --order 2 --text "${TEXT}" --arpa lm.arpa
[2019-03-01 22:05:19] [worker 0] Worker 0 ended with exit code 134
[2019-03-01 22:05:19] [daemon] Worker 0 requested stop. Stopping pit...
kdavis-mozilla commented 5 years ago

This is not related to snakepit. You are allocating too much memory....

...
Cannot allocate memory for 94530609152 bytes in malloc
...

each machine has "only" 128GB.

Basically you should look at your kenlm call and change it so it does not allocate so much memory.

JRMeyer commented 5 years ago

@kdavis-mozilla - by each machine do you mean each node?

If that's the case, then before you can allocate memory for a certain job, you must know how much memory the other jobs are taking up? Yes? That seems like a snakepit issue.

This error only occurs when I have multiple jobs using kenlm on the same node.

If each job gets 128GB, then I'm obviously wrong.

kdavis-mozilla commented 5 years ago

Yes each node.

It's not a snakepit issue as snakepit does not, and can not (think halting problem), figure out how much memory your program will use.

kenlm allows you to limit the amount of memory you use, the -S option. So you should use that option.

JRMeyer commented 5 years ago

I see what you mean... and I think you're right that the best solution is the -s option.

tilmankamp commented 5 years ago

Some comment on the memory: There is a (rather easy) way to limit memory and CPU consumption of a node on a per job basis. It's not implemented yet, as there are some open questions: What should be the default limits. Fixed size? Fixed fraction based on the number of allocated GPUs? "Hard" limit or "soft" limit? Can limits be exceeded? On rights or groups basis?...

kdavis-mozilla commented 5 years ago

@JRMeyer could you limit the memory with the -S option. It is by far the most efficient solution to this problem.

JRMeyer commented 5 years ago

@kdavis-mozilla - at this point, it's not an issue for me. The -s option is fine.

I thought this issue may be more general than my specific kenlm use-case. If this isn't that important to other users of snakepit, then I'd close the issue.