stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.81k stars 1.51k forks source link

overflow files: too many open files #193

Closed lfoppiano closed 3 years ago

lfoppiano commented 3 years ago

I'm trying to re-train the gloVe embeddings on a large corpus.

The cooccur command generates a lot of new files (28745 to be precise πŸ˜„ ) but then when recombining them it crashes, see below:

Writing cooccurrences to disk..............28745 files in total.
Merging cooccurrence files: processed 0 lines.Unable to open file overflow_1021.bin.
Errno: 24
Error description: Too many open files

I was wondering whether it would be possible to increase the size of these files (it seems they are 512 Mb) to, let's say 1 or 2Gb?

The option -overflow-length is not very clear to me on how to be used:

    -overflow-length <int>
        Limit to length <int> the sparse overflow array, which buffers cooccurrence data that does not fit in the dense array, before writing to disk. 
        This value overrides that which is automatically produced by '-memory'. Typically only needs adjustment for use with very large corpora.

The alternative I might need to increase the maximum number of open files I guess.

Any suggestions?

AngledLuffa commented 3 years ago

It's not ideal, but can you change your ulimit?

AngledLuffa commented 3 years ago

It looks like, based on reading the code in cooccur.c, it is splitting the files up based on how much memory you have:

    rlimit = 0.85 * (real)memory_limit * 1073741824/(sizeof(CREC));
    while (fabs(rlimit - n * (log(n) + 0.1544313298)) > 1e-3) n = rlimit / (log(n) + 0.1544313298);
    max_product = (long long) n;
    overflow_length = (long long) rlimit/6; // 0.85 + 1/6 ~= 1

There is a flag -overflow-length to change the size of the files. However, presumably you'll run out of memory if you try to use that, unless it's off by a factor of 30. You could try adding more memory or using less data.

Another option would be to change merge_files to do two steps if there are more than 1000 (or ulimit) temp files, but I personally won't make that change unless this turns into a frequent issue, and I can guarantee no one else here will make such a change. If you make a PR with such a change, though, we'd be happy to integrate it.

lfoppiano commented 3 years ago

@AngledLuffa thanks!

I have changed the --memory and increase it to 90.0 (as the machine has a lot of ram) and the files now are 9.5Gb each. To be on the safe side I also increased the maximum number of opened files.

Let's see in a couple of weeks if it will run fine πŸ˜‰

AngledLuffa commented 3 years ago

Oh, I see, it's not trying to detect how much memory there is but rather lets the user say how much there is. Now that I look for it, I see there isn't really any portable way of looking for free RAM in C

lfoppiano commented 3 years ago

Thanks!