texttron / tevatron

Tevatron - A flexible toolkit for neural retrieval research and development.
http://tevatron.ai
Apache License 2.0
435 stars 87 forks source link

About RepLLaMA #103

Open sunxiaojie99 opened 5 months ago

sunxiaojie99 commented 5 months ago

Hi~I am trying to reproduce the results of RepLLaMA. I have an a800 GPU. If I start training RepLLaMA from scratch with your code, it may take 80 hours? I want to know if this is normal? If possible, I would like to know the time cost when training RepLLaMA (lora) on the msmarco passage and doc datasets? Thank you very much. @MXueguang

MXueguang commented 5 months ago

Hi Xiaojie, I trained repllama (passage) on 16 v100 32g gpu, which took me around 1 day. I think 80 hours on a single a800 GPU is a reasonable time. On msmarco-doc, if the max input length is set as 2048, it will take 3 days on 16 gpus.

sunxiaojie99 commented 5 months ago

Hi Xueguang, @MXueguang

Thank you very much for sharing your code. However, when I tested it on a small test MSMARCO passage corpus (the first 100 passages), I encountered an issue: after encoding, the embeddings of some passages turned out to be NaN. Have you experienced this problem?

The part of your code that I modified is located here: https://github.com/texttron/tevatron/blob/2e5d00ee21d5a7db0bd2ea1463c9150a572106d4/examples/repllama/utils.py#L41. I made these changes for two reasons: 1) xformers was not functioning correctly in my environment. If possible, i want to know the reason why you reset the forward function, Is this step necessary? 2) the attention_mask input in the custom_forward function did not seem to be utilized in the subsequent code. Does this mean that the padding positions will still receive attention?

Please forgive my limited experience in this area. Your insights would be greatly appreciated.

Here are the changes I made:

# Original code
        attn_weights = None
        attn_output = xops.memory_efficient_attention(
            query_states.transpose(1, 2), key_states.transpose(1, 2), value_states.transpose(1, 2),
            attn_bias=xops.LowerTriangularMask()
        ).reshape(bsz, q_len, self.hidden_size)

Modified to:

        # Scale queries for dot-product attention
        query_states = query_states / (self.head_dim ** 0.5)

        # Dot-product attention, [bsz, num_heads, q_len, head_dim]*[bsz, num_head, head_dim, q_len]
        attn_scores = torch.matmul(query_states, key_states.transpose(-2, -1))

        # Apply lower triangular mask
        if attn_scores.size(1) == attn_scores.size(2):
            # Only square matrices require masking
            mask = torch.tril(torch.ones_like(attn_scores.float())).type_as(attn_scores)
            attn_scores = attn_scores.masked_fill(mask == 0, float('-inf'))

        # Apply attention mask
        if attention_mask is not None:
            attn_scores = attn_scores + attention_mask

        attn_probs = softmax(attn_scores, dim=-1)
        attn_output = torch.matmul(attn_probs, value_states)

        attn_output = attn_output.transpose(1, 2).reshape(bsz, q_len, self.hidden_size)
MXueguang commented 5 months ago

my transformers version is 4.31.0. I think later version has some issue here it is ok to remove the flash attention replacement and use default llama class. I'll update the code to make it fit latest transformers, and I am trying to do a refactor here https://github.com/texttron/tevatron/tree/refactor

btw, repllama code in tevatron is a re-implementation, and due to limited resource I didn't get chance to do very detailed tests. Feel free to let me know any issues there.

sunxiaojie99 commented 5 months ago

ok~ so I only need to comment out this line of code replace_with_xformers_attention() in train.py? I will run it again to check if everything is normal. thank you!

MXueguang commented 5 months ago

so I only need to comment out this line of code replace_with_xformers_attention() in train.py

yes, in train.py and encode.py

sunxiaojie99 commented 5 months ago

Hi Xueguang, I think I've found the issue with the NaN embedding. I've noticed that when we use fp16 during encoding, this problem occurs. However, when we switch to fp32, everything seems fine. By the way, could I ask you to provide the training data (or the co-condenser hard negative) for MSMARCO-passage/doc used in your paper 'Fine-Tuning LLaMA for Multi-Stage Text Retrieval'?

MXueguang commented 5 months ago

its a bit weird fp16 not works...the model was finetuned with fp16...I'll take a look.

I created a training data for repllama in tevatron format can be downloaded here https://www.dropbox.com/scl/fi/pkm1mtgfobae9kuesp7dr/train-tevatron.jsonl?rlkey=2thutc4zkozr9jp4zbbrz5rvi&dl=0

MXueguang commented 5 months ago

Hi @sunxiaojie99, are you getting similar training log as https://github.com/texttron/tevatron/issues/104?

sunxiaojie99 commented 5 months ago

Hi @sunxiaojie99, are you getting similar training log as #104?

I just completed the test on the small corpus. I will run the entire process later and then confirm this.

sunxiaojie99 commented 5 months ago

its a bit weird fp16 not works...the model was finetuned with fp16...I'll take a look.

I created a training data for repllama in tevatron format can be downloaded here https://www.dropbox.com/scl/fi/pkm1mtgfobae9kuesp7dr/train-tevatron.jsonl?rlkey=2thutc4zkozr9jp4zbbrz5rvi&dl=0

Thanks for sharing! Does this JSON file contain both the MSMARCO passage and document datasets? By the way, bfp16 is actually used during fine-tuning. When I test using bfp16 during encoding, the NaN issue doesn't appear either. So, I guess the fine-tuning process will run smoothly.

MXueguang commented 5 months ago

I train repllama on v100 gpus which only supports fp16. When I add implementation to tevatron I worked on A6000 so bf16 also work. But the released model was trained on fp16. I'll take a look at the NaN issue next week.

The data in above link is the training data for passage ranking.
document data is bigger, I'll upload it later.

sunxiaojie99 commented 5 months ago

I train repllama on v100 gpus which only supports fp16. When I add implementation to tevatron I worked on A6000 so bf16 also work. But the released model was trained on fp16. I'll take a look at the NaN issue next week.

The data in above link is the training data for passage ranking. document data is bigger, I'll upload it later.

Okay, I sincerely appreciate your help! Please remind me when the document data is ready.

sunxiaojie99 commented 5 months ago

Hi Xueguang,

Sorry to bother you again. I have completed the training process for RepLLaMa. However, it seems that encoding the msmarco passage corpus requires at least 300 hours. I've noticed that Tevatron doesn't support multi-GPU encoding. Could you tell me how long the encoding process took for you? Also, is the document data ready? Haha.

MXueguang commented 5 months ago

Hi Xiaojie,

300 hours on single gpu is reasonable. tevatron dosent support multi-gpu encoding, but a efficient way is to encode the corpus by shard, and run that in parallel. A example below.

mkdir beir_embedding_scifact
for s in 0 1 2 3;
do
CUDA_VISIBLE_DEVICES=$s python encode.py \
  --output_dir=temp \
  --model_name_or_path castorini/repllama-v1-7b-lora-passage \
  --tokenizer_name meta-llama/Llama-2-7b-hf \
  --fp16 \
  --per_device_eval_batch_size 16 \
  --p_max_len 512 \
  --dataset_name Tevatron/beir-corpus:scifact \
  --encoded_save_path beir_embedding_scifact/corpus_scifact.${s}.pkl \
  --encode_num_shard 4 \
  --encode_shard_index ${s} &
done

oops.. thanks for the reminder...uploading the document data now.

MXueguang commented 5 months ago

Hi Xiaojie, the processed training data for document ranking is big and hard to upload. Below is a slim verision, with processd corpus and training data but need a process to convert to tevatron format. https://www.dropbox.com/scl/fi/rbxa9u0dusa4g3fh8sn9j/repllama-doc-slim-corpus.jsonl?rlkey=8ddybs8xt8lq723hks0y2uhku&dl=0 https://www.dropbox.com/scl/fi/sz3oqve6tln2hird03cxv/repllama-doc-slim-train.jsonl?rlkey=t1kjx1wdxky4zjo3zglo6yxzq&dl=0

sunxiaojie99 commented 5 months ago

Hi Xiaojie, the processed training data for document ranking is big and hard to upload. Below is a slim verision, with processd corpus and training data but need a process to convert to tevatron format. https://www.dropbox.com/scl/fi/rbxa9u0dusa4g3fh8sn9j/repllama-doc-slim-corpus.jsonl?rlkey=8ddybs8xt8lq723hks0y2uhku&dl=0 https://www.dropbox.com/scl/fi/sz3oqve6tln2hird03cxv/repllama-doc-slim-train.jsonl?rlkey=t1kjx1wdxky4zjo3zglo6yxzq&dl=0

Ok, thanks! Actually, I think I only need the CoCondenser-MaxP hard negatives for the document ranking data to reliably reproduce the results of the paper. By the way, is the slim version obtained by sampling a smaller proportion?

MXueguang commented 5 months ago

the hard negatives should be top100 bm25 and top 100 cocondenser, but document contents are not saved in the training data. to save the space

sunxiaojie99 commented 5 months ago

the hard negatives should be top100 bm25 and top 100 cocondenser, but document contents are not saved in the training data. to save the space

Okay ~ Is it convenient to tell me other parameters, such as the size of p

MXueguang commented 5 months ago

Hi @sunxiaojie99 , sorry I missed your latest comment. what do you mean size of p? the truncation size? for msmarco document, we truncate the document by 10 sentences, with a slide window of 5 sentences.

riyajatar37003 commented 2 weeks ago

ValueError: Unsupported model class DenseModel(

i am getting this error during saving ckpt