Open sunxiaojie99 opened 5 months ago
Hi Xiaojie, I trained repllama (passage) on 16 v100 32g gpu, which took me around 1 day. I think 80 hours on a single a800 GPU is a reasonable time. On msmarco-doc, if the max input length is set as 2048, it will take 3 days on 16 gpus.
Hi Xueguang, @MXueguang
Thank you very much for sharing your code. However, when I tested it on a small test MSMARCO passage corpus (the first 100 passages), I encountered an issue: after encoding, the embeddings of some passages turned out to be NaN. Have you experienced this problem?
The part of your code that I modified is located here: https://github.com/texttron/tevatron/blob/2e5d00ee21d5a7db0bd2ea1463c9150a572106d4/examples/repllama/utils.py#L41. I made these changes for two reasons: 1) xformers was not functioning correctly in my environment. If possible, i want to know the reason why you reset the forward function, Is this step necessary? 2) the attention_mask input in the custom_forward function did not seem to be utilized in the subsequent code. Does this mean that the padding positions will still receive attention?
Please forgive my limited experience in this area. Your insights would be greatly appreciated.
Here are the changes I made:
# Original code
attn_weights = None
attn_output = xops.memory_efficient_attention(
query_states.transpose(1, 2), key_states.transpose(1, 2), value_states.transpose(1, 2),
attn_bias=xops.LowerTriangularMask()
).reshape(bsz, q_len, self.hidden_size)
Modified to:
# Scale queries for dot-product attention
query_states = query_states / (self.head_dim ** 0.5)
# Dot-product attention, [bsz, num_heads, q_len, head_dim]*[bsz, num_head, head_dim, q_len]
attn_scores = torch.matmul(query_states, key_states.transpose(-2, -1))
# Apply lower triangular mask
if attn_scores.size(1) == attn_scores.size(2):
# Only square matrices require masking
mask = torch.tril(torch.ones_like(attn_scores.float())).type_as(attn_scores)
attn_scores = attn_scores.masked_fill(mask == 0, float('-inf'))
# Apply attention mask
if attention_mask is not None:
attn_scores = attn_scores + attention_mask
attn_probs = softmax(attn_scores, dim=-1)
attn_output = torch.matmul(attn_probs, value_states)
attn_output = attn_output.transpose(1, 2).reshape(bsz, q_len, self.hidden_size)
my transformers version is 4.31.0. I think later version has some issue here it is ok to remove the flash attention replacement and use default llama class. I'll update the code to make it fit latest transformers, and I am trying to do a refactor here https://github.com/texttron/tevatron/tree/refactor
btw, repllama code in tevatron is a re-implementation, and due to limited resource I didn't get chance to do very detailed tests. Feel free to let me know any issues there.
ok~ so I only need to comment out this line of code replace_with_xformers_attention()
in train.py
? I will run it again to check if everything is normal. thank you!
so I only need to comment out this line of code replace_with_xformers_attention() in train.py
yes, in train.py and encode.py
Hi Xueguang, I think I've found the issue with the NaN embedding. I've noticed that when we use fp16 during encoding, this problem occurs. However, when we switch to fp32, everything seems fine. By the way, could I ask you to provide the training data (or the co-condenser hard negative) for MSMARCO-passage/doc used in your paper 'Fine-Tuning LLaMA for Multi-Stage Text Retrieval'?
its a bit weird fp16 not works...the model was finetuned with fp16...I'll take a look.
I created a training data for repllama in tevatron format can be downloaded here https://www.dropbox.com/scl/fi/pkm1mtgfobae9kuesp7dr/train-tevatron.jsonl?rlkey=2thutc4zkozr9jp4zbbrz5rvi&dl=0
Hi @sunxiaojie99, are you getting similar training log as https://github.com/texttron/tevatron/issues/104?
Hi @sunxiaojie99, are you getting similar training log as #104?
I just completed the test on the small corpus. I will run the entire process later and then confirm this.
its a bit weird fp16 not works...the model was finetuned with fp16...I'll take a look.
I created a training data for repllama in tevatron format can be downloaded here https://www.dropbox.com/scl/fi/pkm1mtgfobae9kuesp7dr/train-tevatron.jsonl?rlkey=2thutc4zkozr9jp4zbbrz5rvi&dl=0
Thanks for sharing! Does this JSON file contain both the MSMARCO passage and document datasets? By the way, bfp16 is actually used during fine-tuning. When I test using bfp16 during encoding, the NaN issue doesn't appear either. So, I guess the fine-tuning process will run smoothly.
I train repllama on v100 gpus which only supports fp16. When I add implementation to tevatron I worked on A6000 so bf16 also work. But the released model was trained on fp16. I'll take a look at the NaN issue next week.
The data in above link is the training data for passage ranking.
document data is bigger, I'll upload it later.
I train repllama on v100 gpus which only supports fp16. When I add implementation to tevatron I worked on A6000 so bf16 also work. But the released model was trained on fp16. I'll take a look at the NaN issue next week.
The data in above link is the training data for passage ranking. document data is bigger, I'll upload it later.
Okay, I sincerely appreciate your help! Please remind me when the document data is ready.
Hi Xueguang,
Sorry to bother you again. I have completed the training process for RepLLaMa. However, it seems that encoding the msmarco passage corpus requires at least 300 hours. I've noticed that Tevatron doesn't support multi-GPU encoding. Could you tell me how long the encoding process took for you? Also, is the document data ready? Haha.
Hi Xiaojie,
300 hours on single gpu is reasonable. tevatron dosent support multi-gpu encoding, but a efficient way is to encode the corpus by shard, and run that in parallel. A example below.
mkdir beir_embedding_scifact
for s in 0 1 2 3;
do
CUDA_VISIBLE_DEVICES=$s python encode.py \
--output_dir=temp \
--model_name_or_path castorini/repllama-v1-7b-lora-passage \
--tokenizer_name meta-llama/Llama-2-7b-hf \
--fp16 \
--per_device_eval_batch_size 16 \
--p_max_len 512 \
--dataset_name Tevatron/beir-corpus:scifact \
--encoded_save_path beir_embedding_scifact/corpus_scifact.${s}.pkl \
--encode_num_shard 4 \
--encode_shard_index ${s} &
done
oops.. thanks for the reminder...uploading the document data now.
Hi Xiaojie, the processed training data for document ranking is big and hard to upload. Below is a slim verision, with processd corpus and training data but need a process to convert to tevatron format. https://www.dropbox.com/scl/fi/rbxa9u0dusa4g3fh8sn9j/repllama-doc-slim-corpus.jsonl?rlkey=8ddybs8xt8lq723hks0y2uhku&dl=0 https://www.dropbox.com/scl/fi/sz3oqve6tln2hird03cxv/repllama-doc-slim-train.jsonl?rlkey=t1kjx1wdxky4zjo3zglo6yxzq&dl=0
Hi Xiaojie, the processed training data for document ranking is big and hard to upload. Below is a slim verision, with processd corpus and training data but need a process to convert to tevatron format. https://www.dropbox.com/scl/fi/rbxa9u0dusa4g3fh8sn9j/repllama-doc-slim-corpus.jsonl?rlkey=8ddybs8xt8lq723hks0y2uhku&dl=0 https://www.dropbox.com/scl/fi/sz3oqve6tln2hird03cxv/repllama-doc-slim-train.jsonl?rlkey=t1kjx1wdxky4zjo3zglo6yxzq&dl=0
Ok, thanks! Actually, I think I only need the CoCondenser-MaxP hard negatives for the document ranking data to reliably reproduce the results of the paper. By the way, is the slim version obtained by sampling a smaller proportion?
the hard negatives should be top100 bm25 and top 100 cocondenser, but document contents are not saved in the training data. to save the space
the hard negatives should be top100 bm25 and top 100 cocondenser, but document contents are not saved in the training data. to save the space
Okay ~ Is it convenient to tell me other parameters, such as the size of p
Hi @sunxiaojie99 , sorry I missed your latest comment. what do you mean size of p? the truncation size? for msmarco document, we truncate the document by 10 sentences, with a slide window of 5 sentences.
ValueError: Unsupported model class DenseModel(
i am getting this error during saving ckpt
Hi~I am trying to reproduce the results of RepLLaMA. I have an a800 GPU. If I start training RepLLaMA from scratch with your code, it may take 80 hours? I want to know if this is normal? If possible, I would like to know the time cost when training RepLLaMA (lora) on the msmarco passage and doc datasets? Thank you very much. @MXueguang