tensorflow / models

Models and examples built with TensorFlow
Other
77.18k stars 45.75k forks source link

textsum: After a week of training, no loss reported, no files in train directory #716

Closed theis188 closed 7 years ago

theis188 commented 7 years ago

This is about the textsum model.

I ran the model in train mode with about 80k articles (vocabulary ~40k) but after about a week still no training loss had been reported and no files have been written to the train directory. It appears that no training took place. A couple of notes:

Is my system just too slow? Too little memory to train this much? Thoughts?

I am running in decode mode right now to see if anything pops out somehow.

PS: I have run training + decoding 'successfully' with about 1k articles in the training set using similar setup (though on a different machine).

PS Edit: I've been having some permission issues on the machine. Is it possible a permission issue prevented the train folder from being written to?

PS PS Edit: In a possibly related issue, I am seeing this error: pthread_cond_wait: Resource busy

theis188 commented 7 years ago

I discovered that the error came from the changes I had made to data.py. The model could not load any examples and therefore could not train.

I reverted to the original version and converted the input files and the model is now training as expected.