Closed JRMeyer closed 5 years ago
This is not related to snakepit. You are allocating too much memory....
...
Cannot allocate memory for 94530609152 bytes in malloc
...
each machine has "only" 128GB.
Basically you should look at your kenlm call and change it so it does not allocate so much memory.
@kdavis-mozilla - by each machine do you mean each node?
If that's the case, then before you can allocate memory for a certain job, you must know how much memory the other jobs are taking up? Yes? That seems like a snakepit issue.
This error only occurs when I have multiple jobs using kenlm
on the same node.
If each job gets 128GB, then I'm obviously wrong.
Yes each node.
It's not a snakepit issue as snakepit does not, and can not (think halting problem), figure out how much memory your program will use.
kenlm allows you to limit the amount of memory you use, the -S
option. So you should use that option.
I see what you mean... and I think you're right that the best solution is the -s
option.
Some comment on the memory: There is a (rather easy) way to limit memory and CPU consumption of a node on a per job basis. It's not implemented yet, as there are some open questions: What should be the default limits. Fixed size? Fixed fraction based on the number of allocated GPUs? "Hard" limit or "soft" limit? Can limits be exceeded? On rights or groups basis?...
@JRMeyer could you limit the memory with the -S option. It is by far the most efficient solution to this problem.
@kdavis-mozilla - at this point, it's not an issue for me. The -s option is fine.
I thought this issue may be more general than my specific kenlm use-case. If this isn't that important to other users of snakepit, then I'd close the issue.
@tilmankamp I think this is happening when I run two or more CPU / RAM intensive jobs on the same node:
This is from job
4120
, which was run on the same node as job4119
. Both jobs are CPU / RAM intensive, and only4119
survived.This is happening with
kenlm
, before anydeepspeech
spins off.