ReadToken errors in multi-gpu training

mixcoder commented 7 years ago

hi alls, In multi-gpu training, I encounter strange errs, sometimes in "tr.iter.1.log", sometimes in "cv.iter.1.log", "*" means a rand iterator number, following is the err info：

cv.iter3.1.log:ERROR (train-ctc-parallel:ReadToken():io-funcs.cc:155) ReadToken, failed to read token at file position -1 cv.iter3.1.log:ERROR (train-ctc-parallel:ReadToken():io-funcs.cc:155) ReadToken, failed to read token at file position -1

or sometimes:

tr.iter2.1.log:ERROR (train-ctc-parallel:ReadToken():io-funcs.cc:155) ReadToken, failed to read token at file position -1 tr.iter2.1.log:ERROR (train-ctc-parallel:ReadToken():io-funcs.cc:155) ReadToken, failed to read token at file position -1

what's the possible reasons for this?

fmetze commented 7 years ago

Some stability problem with file access? Are these local files, or being accessed via NFS or some other shared file system? Always for process 1, i.e. the master thread?

mixcoder commented 7 years ago

these files being accessed via NFS, I am running ctc training in a cluster, with 1 master node and 9 other common node. By looking the source code and printing extra debug info, I found that sometimes the master thread(maybe job1 more exact) begin to read file(e.g . "nnet/nnet.iter1.cv.done.job20") that has not been written over yet by other job（job20）, this cause the error!

mixcoder commented 7 years ago

hi @fmetze , I make a check in the code that if the file to be read do have not been written finished, let the master job wait for that! This seems works for me, so I will close this issue! thank you for your reply!

fmetze commented 7 years ago

if the fix worked, would you mind submitting it? thanks!

mixcoder commented 7 years ago

ok, I will submit it in a few days!

ericbolo commented 7 years ago

@mixcoder, I am now encountering the same error in a multi-GPU setting, probably for the same reason, i.e. the master attempting to read a file that has not been written to yet.

Mind telling me which file you made your fix in, or even better your fix code?

I could make the pull request for you if that's a hassle.

ericbolo commented 7 years ago

I wrote a quick hack to pinpoint the problem.

In net/communicator.h, adding a sleep(1) (1 second) before it reads from the log file fixes the issue but adds a 1 second latency before every file read (l. 150, before ReadToken())

I'll work on a cleaner fix and submit it as a pull request in the next few days.

ericbolo commented 7 years ago

A colleague and I wrote a cleaner fix. The master job attempts to read from the subjob output file only if the file exists AND is not empty.

in src/base/kaldi-utils.cc, adding the following function:

bool FileNotEmpty(const char *file_name) { std::ifstream infile(file_name, std::ifstream::ate | std::ifstream::binary); return (infile.tellg()==0); }

(add the corresponding function signature in src/base/kaldi-utils.h)

in src/net/communicator.h, reading from subjob output file only if file exists AND is not empty:

if (FileExist(subjob_done_filename.c_str())) { if (FileNotEmpty(subjob_done_filename.c_str())) { // read the file } }

@fmetze @riebling , does this seem like a good fix? Any other place where that race condition might occur in a multi-GPU setting? If satisfactory I will make a pull request.

fmetze commented 7 years ago

Has all of this been tried over NFS file systems, or local discs? I have not worked with this code in a long time, but I think the best way to make these errors go away is to recommend to run the training on a single node, with a local file system, rather than a shared networked filesystem? Of course that may limit the numbers of GPUs that you can access - not sure if this is a factor?

ericbolo commented 7 years ago

I do use NFS, I am storing the data on the cloud on AWS EBS https://aws.amazon.com/fr/ebs/pricing/.

Storing the data locally would imply uploading all the data to my cloud instance every time (unless I'm missing something).

On 11 August 2017 at 15:48, Florian Metze notifications@github.com wrote:

Has all of this been tried over NFS file systems, or local discs? I have not worked with this code in a long time, but I think the best way to make these errors go away is to recommend to run the training on a single node, with a local file system, rather than a shared networked filesystem? Of course that may limit the numbers of GPUs that you can access - not sure if this is a factor?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/134#issuecomment-321816632, or mute the thread https://github.com/notifications/unsubscribe-auth/AEXQ_PKD6zhmeSOCjejYoK0l8k7oXK9sks5sXFuQgaJpZM4NaWWi .

-- Eric Bolo CTO tel: 06 29 72 19 80 http://06%2058%2079%2025%2037/

Batvoice Technologies 10, rue Coquillière - 75001 Paris www.batvoice.com

fmetze commented 7 years ago

It is typically a good strategy to copy all data that you are going to need onto the VM anyway at the beginning of the training process, because access will be much faster to local discs. Traffic into the VM is free, and you won’t need to retrieve it.

On Aug 11, 2017, at 8:19 PM, ericbolo notifications@github.com wrote:

I do use NFS, I am storing the data on the cloud on AWS EBS https://aws.amazon.com/fr/ebs/pricing/.

Storing the data locally would imply uploading all the data to my cloud instance every time (unless I'm missing something).

On 11 August 2017 at 15:48, Florian Metze notifications@github.com wrote:

Has all of this been tried over NFS file systems, or local discs? I have not worked with this code in a long time, but I think the best way to make these errors go away is to recommend to run the training on a single node, with a local file system, rather than a shared networked filesystem? Of course that may limit the numbers of GPUs that you can access - not sure if this is a factor?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/134#issuecomment-321816632, or mute the thread https://github.com/notifications/unsubscribe-auth/AEXQ_PKD6zhmeSOCjejYoK0l8k7oXK9sks5sXFuQgaJpZM4NaWWi .

-- Eric Bolo CTO tel: 06 29 72 19 80 http://06%2058%2079%2025%2037/

Batvoice Technologies 10, rue Coquillière - 75001 Paris www.batvoice.com — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/134#issuecomment-321944406, or mute the thread https://github.com/notifications/unsubscribe-auth/AEnA8fOnV-zvmD5Ct2_ZXTrwg84ImUjnks5sXO-JgaJpZM4NaWWi.

ericbolo commented 6 years ago

Since this issue for NFS-only, I won't submit the pull request.

However, if someone is on an NFS system (for instance, an AWS EC2), I noticed a mistake in the code I wrote in my previous comment, to check for empty file, the code should be "infile.tellg()>0" NOT "infile.tellg()==0".

srvk / eesen

ReadToken errors in multi-gpu training #134