Closed lfoppiano closed 3 years ago
It's not ideal, but can you change your ulimit?
It looks like, based on reading the code in cooccur.c
, it is splitting the files up based on how much memory you have:
rlimit = 0.85 * (real)memory_limit * 1073741824/(sizeof(CREC));
while (fabs(rlimit - n * (log(n) + 0.1544313298)) > 1e-3) n = rlimit / (log(n) + 0.1544313298);
max_product = (long long) n;
overflow_length = (long long) rlimit/6; // 0.85 + 1/6 ~= 1
There is a flag -overflow-length
to change the size of the files. However, presumably you'll run out of memory if you try to use that, unless it's off by a factor of 30. You could try adding more memory or using less data.
Another option would be to change merge_files to do two steps if there are more than 1000 (or ulimit) temp files, but I personally won't make that change unless this turns into a frequent issue, and I can guarantee no one else here will make such a change. If you make a PR with such a change, though, we'd be happy to integrate it.
@AngledLuffa thanks!
I have changed the --memory and increase it to 90.0 (as the machine has a lot of ram) and the files now are 9.5Gb each. To be on the safe side I also increased the maximum number of opened files.
Let's see in a couple of weeks if it will run fine π
Oh, I see, it's not trying to detect how much memory there is but rather lets the user say how much there is. Now that I look for it, I see there isn't really any portable way of looking for free RAM in C
Thanks!
I'm trying to re-train the gloVe embeddings on a large corpus.
The
cooccur
command generates a lot of new files (28745 to be precise π ) but then when recombining them it crashes, see below:I was wondering whether it would be possible to increase the size of these files (it seems they are 512 Mb) to, let's say 1 or 2Gb?
The option
-overflow-length
is not very clear to me on how to be used:The alternative I might need to increase the maximum number of open files I guess.
Any suggestions?