google-research / electra

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
Apache License 2.0
2.31k stars 351 forks source link

ERROR:tensorflow: Failed to close session after error.Other threads may hang. #102

Closed etetteh closed 3 years ago

etetteh commented 3 years ago

I am trying to pretrain my ELECTRA base, I keep getting this output:

Running training
================================================================================
2020-11-13 08:00:18.044763: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:370] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
Model is built!
2020-11-13 08:00:48.956655: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:370] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
ERROR:tensorflow:Error recorded from infeed: From /job:worker/replica:0/task:0:
{{function_node __inference_tf_data_experimental_map_and_batch_<lambda>_69}} Key: segment_ids.  Can't parse serialized Example.
     [[{{node ParseSingleExample/ParseSingleExample}}]]
     [[input_pipeline_task0/while/IteratorGetNext]]
ERROR:tensorflow:Closing session due to error From /job:worker/replica:0/task:0:
{{function_node __inference_tf_data_experimental_map_and_batch_<lambda>_69}} Key: segment_ids.  Can't parse serialized Example.
     [[{{node ParseSingleExample/ParseSingleExample}}]]
     [[input_pipeline_task0/while/IteratorGetNext]]
2020-11-13 08:01:08.642776: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1605254468.642525410","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2020-11-13 08:01:08.642779: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1605254468.642549072","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
ERROR:tensorflow:Error recorded from outfeed: Step was cancelled by an explicit call to `Session::Close()`.
ERROR:tensorflow:

Failed to close session after error.Other threads may hang.

2020-11-13 08:01:50.857700: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:370] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
ERROR:tensorflow:Error recorded from infeed: From /job:worker/replica:0/task:0:
{{function_node __inference_tf_data_experimental_map_and_batch_<lambda>_69}} Key: segment_ids.  Can't parse serialized Example.
     [[{{node ParseSingleExample/ParseSingleExample}}]]
     [[input_pipeline_task0/while/IteratorGetNext]]
briverse17 commented 3 years ago

I keep getting these errors, too.

I have tried installing and using other Python versions. Also, I change Tensorflow version to 1.15.dev20190909. None of the above solved the problem.

Waiting for a possible solution.

etetteh commented 3 years ago

I have tried every solution possible, and none is working. I'm just wondering how they trained their model, as I really need this to complete a time-bomb project.

briverse17 commented 3 years ago

Hi, @etetteh

I think I figured out the problem: max_seq_length of pretraining configuration must not exceed max_seq_length when building tfrecords.

I built my tfrecords with max_seq_length = 128 (the default) so I cannot pretrain with max_seq_length = 256 or 512.

I tried set max_seq_length = 128 and trained a small model. Things go smoothly!

Regards,

etetteh commented 3 years ago

Great. I was about commenting that I fixed mine too. Same stuff I had to change, plus some environment issues

etetteh commented 3 years ago

@briverse17 pretraining was successful but, I am get this error during finetuing. Did you have a similar problem?