Closed ousou closed 4 years ago
Thank you for sending this. I will merge the change which refactors common code into a single file later today. If you can rebase off that, that would be excellent. If not, I can redo this change to use the common code files myself.
I refactored this as desired and merged it, giving credit to you. Thanks!
Great, thanks for taking care of the refactoring!
After this change whenever a file load fails the process will log the errno and the error description set by fopen.
As an example, instead of the error message:
Writing cooccurrences to disk............1523 files in total. Merging cooccurrence files: processed 0 lines.Unable to open file overflow_1021.bin.
the process may now outputs the following:
Merging cooccurrence files: processed 0 lines.Unable to open file overflow_1021.bin. Errno: 24 Error description: Too many open files
Some background to this PR: We actually encountered the error above when creating vectors for a large corpora (about 80 billion tokens). The issue was that the cooccur process tried to open too many files during the merge_files phase, and thus the process crashed. We solved that issue by increasing the amount of allowed open files by using the following command:
ulimit -n 2048
This solved the issue for us since we had about 1500 overflow files. The default limit for open files for a single process in Ubuntu seems to be 1024.