loss is nan when pretaining on my own dataset

liangzimei commented 2 years ago

hi, thanks for your excellent work firstly. when i train my own chinese dataset (so i change the bert-base-uncased to bert-base-chinese), loss becomes nan after several iterations. i have tried to decrease the lr and add grad_clip, but the problem still exists. here is my training config:

can you give me some suggestion? thanks in advance.

viyjy commented 2 years ago

Hi, thanks for your interest in our paper. Could you please use the following command line for pre-training?

python -m torch.distributed.launch --nproc_per_node=8 \
--use_env Pretrain.py \
--config ./configs/Pretrain.yaml \
--output_dir output/pretrain \
--text_encoder bert-base-chinese

More details can be found in this line Please let me know if this helps. Thanks.

viyjy commented 2 years ago

Sorry, I just realized that you are using bert-base-chinese. Can you show me config_bert_chinese.json? Thanks.

liangzimei commented 2 years ago

Sorry, I just realized that you are using bert-base-chinese. Can you show me config_bert_chinese.json? Thanks.

my config_bert_chinese.json is like this, just a copy from https://huggingface.co/ckiplab/bert-base-chinese/blob/main/config.json, and addingg fusion_layer and encoder_width. { "architectures": [ "BertForMaskedLM" ], "attention_probs_dropout_prob": 0.1, "directionality": "bidi", "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-12, "max_position_embeddings": 512, "model_type": "bert", "num_attention_heads": 12, "num_hidden_layers": 12, "pad_token_id": 0, "pooler_fc_size": 768, "pooler_num_attention_heads": 12, "pooler_num_fc_layers": 3, "pooler_size_per_head": 128, "pooler_type": "first_token_transform", "type_vocab_size": 2, "vocab_size": 21128, "fusion_layer": 6, "encoder_width": 768 }

viyjy commented 2 years ago

How many GPUs are you using?

liangzimei commented 2 years ago

How many GPUs are you using?

i also use 8 gpus. and today i use the same configs and dataset to run ALBEF, the loss looks normal now.

viyjy commented 2 years ago

Thanks. May I know which dataset are you using? In this case, I need to test it on my machine to reproduce the problem.

liangzimei commented 2 years ago

Thanks. May I know which dataset are you using? In this case, I need to test it on my machine to reproduce the problem.

sorry, the data is selected by myself, mainly from '抖音' app. the image-text pairs contain the videos' title and cover.

viyjy commented 2 years ago

Thanks. I will try to collect some data with chinese text to figure out the problem. Will let you know ASAP.

liangzimei commented 2 years ago

Thanks. May I know which dataset are you using? In this case, I need to test it on my machine to reproduce the problem.

sorry, the data is selected by myself, mainly from '抖音' app. the image-text pairs contain the videos' title and cover.

Or i can upload a portion of data to google drive ? is it convenient for you?

viyjy commented 2 years ago

Thanks. May I know which dataset are you using? In this case, I need to test it on my machine to reproduce the problem.

sorry, the data is selected by myself, mainly from '抖音' app. the image-text pairs contain the videos' title and cover.

Or i can upload a portion of data to google drive ? is it convenient for you?

Sure, I appreciate it. You can email the data link to jinyu.yang@mavs.uta.edu

liangzimei commented 2 years ago

Thanks. May I know which dataset are you using? In this case, I need to test it on my machine to reproduce the problem.

sorry, the data is selected by myself, mainly from '抖音' app. the image-text pairs contain the videos' title and cover.

Or i can upload a portion of data to google drive ? is it convenient for you?

Sure, I appreciate it. You can email the data link to jinyu.yang@mavs.uta.edu

ok, i have alreadly sent the data to you ~

viyjy commented 2 years ago

Hi, the reason of this nan loss is that your dataset contains empty captions, for example one data sample in your dataset is {'caption': '', 'image': '6912609964733271309_00001.jpg'}. Removing such invalid samples solves this problem. You can use the following code to remove invalid samples from the json file.

json_path = 'data.json'
new_json_path = 'data_new.json'

f = open(json_path)
data = json.load(f)
new_data = []

for i in range(len(data)):
    if len(data[i]['caption'].strip()) != 0:
        new_data.append(data[i])

f.close()

with open(new_json_path, 'w') as jsonfile:
    json.dump(new_data, jsonfile, ensure_ascii=False)

Feel free to let me know if you might need additional information. Thanks.

liangzimei commented 2 years ago

Hi, the reason of this nan loss is that your dataset contains empty captions, for example one data sample in your dataset is {'caption': '', 'image': '6912609964733271309_00001.jpg'}. Removing such invalid samples solves this problem. You can use the following code to remove invalid samples from the json file.
json_path = 'data.json'
new_json_path = 'data_new.json'

f = open(json_path)
data = json.load(f)
new_data = []

for i in range(len(data)):
    if len(data[i]['caption'].strip()) != 0:
        new_data.append(data[i])

f.close()

with open(new_json_path, 'w') as jsonfile:
    json.dump(new_data, jsonfile, ensure_ascii=False)
Feel free to let me know if you might need additional information. Thanks.

Thank you very much, it is my mistake.

caisarl76 commented 8 months ago

Hi, the reason of this nan loss is that your dataset contains empty captions, for example one data sample in your dataset is {'caption': '', 'image': '6912609964733271309_00001.jpg'}. Removing such invalid samples solves this problem. You can use the following code to remove invalid samples from the json file.
json_path = 'data.json'
new_json_path = 'data_new.json'

f = open(json_path)
data = json.load(f)
new_data = []

for i in range(len(data)):
    if len(data[i]['caption'].strip()) != 0:
        new_data.append(data[i])

f.close()

with open(new_json_path, 'w') as jsonfile:
    json.dump(new_data, jsonfile, ensure_ascii=False)
Feel free to let me know if you might need additional information. Thanks.

I was stuck at the same error and this helped me out! Thanks for @viyjy and @liangzimei too 😄

uta-smile / TCL

loss is nan when pretaining on my own dataset #4