Problem with contrastive loss in pretrain stage

Thanks for your great work. I meet the problem when using the same hyperparameters in NQ example pre-train on the second stage like coCondenser (we call uptrain stage with contrastive loss). Our template includes 1 query, 1 positive and 10 negative passages with our custom dataloader using a streaming mode dataset (dataset includes two languages with 25M triplet samples), our model based on bert-base-multilingual-cased has been continuing pretrain with MLM loss curve. It seems pre-train on contrastive loss can not be converged, here is the training script

python -m torch.distributed.launch --nproc_per_node=8 -m asymmetric.train \
    --model_name_or_path 'asymmetric/checkpoint-10000' \
    --streaming \
    --output $saved_path \
    --do_train \
    --train_dir 'data/train' \
    --max_steps 10000 \
    --per_device_train_batch_size 32 \
    --dataset_num_proc 2 \
    --train_n_passages 8 \
    --gc_q_chunk_size 8 \
    --gc_p_chunk_size 64 \
    --untie_encoder \
    --negatives_x_device \
    --learning_rate 5e-4 \
    --weight_decay 1e-2 \
    --warmup_ratio 0.1 \
    --save_steps 1000 \
    --save_total_limit 20 \
    --logging_steps 50 \
    --q_max_len 128 \
    --p_max_len 384 \
    --fp16 \
    --report_to 'wandb' \
    --overwrite_output_dir

texttron / tevatron

Problem with contrastive loss in pretrain stage #94