can't allocate memory RuntimeError exception in _calc_mention_logits

shon-otmazgin / fastcoref

MIT License

142 stars 25 forks source link

can't allocate memory RuntimeError exception in _calc_mention_logits #23

Closed aryehgigi closed 1 year ago

aryehgigi commented 1 year ago

I got the following exception: RuntimeError: [enforce fail at alloc_cpu.cpp:75] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 368596058460 bytes. Error code 12 (Cannot allocate memory)

it happens on this file: fastcoref/coref_models/modeling_fcoref.py", line 190, in _calc_mention_logits: mention_logits = joint_mention_logits + start_mention_logits.unsqueeze(-1) + end_mention_logits.unsqueeze(-2)

Unfortunately it seems that it consumes the memory and doesn't release it, so my thread hangs (even if i wrap the prediction with a try/except block).

Maybe we can add a validation before computing the scores that the lengths aren't too long, and if they do to return an error?

Thanks @shon-otmazgin , @ariecattan

shon-otmazgin commented 1 year ago

Hello @aryehgigi did you tried to lower the batch size? maybe the error you're facing is #threads * batch_size in each thread

aryehgigi commented 1 year ago

do you mean in each process? I can decrease the batch size but:

it will complicate my code, as each batch is a paragraph and my code relies on that (think of the generic user - it would force them to change the logic of how they batch - if it is not just a number they put)
the user doesnt know in advance if they are going to reach a limit, so if you provide them with a more careful protection they could possible use the retrieved error and decide if they want to throw the batch or to split only this one to more pieces?

wdyt?

shon-otmazgin commented 1 year ago

Usually, this is how batches behave, for instance when you train a new model, you set up a batch size, and sometimes you get OOM, then you are optimizing the code to handle it.

For your use case, the batch size should be a few paragraphs, but then, you probably open a few threads in the same resource so we can't control such a use case from the package perspective.

aryehgigi commented 1 year ago

got it, thanks :)