[UPRISE]When training the uprise,a problem happened.

microsoft / LMOps

General technology for enabling AI capabilities w/ LLMs and MLLMs

https://aka.ms/GeneralAI

MIT License

3.39k stars 253 forks source link

[UPRISE]When training the uprise,a problem happened. #238

Open zhouchang123 opened 3 days ago

zhouchang123 commented 3 days ago

Here is the config and the log info. Here is the wrong message. Everythings seems correct but until the first epoch goes end. What's the reason of this problem?

cdxeve commented 3 days ago

The data amount is too small for validation, try increasing the data amount.

zhouchang123 commented 3 days ago

Do you mean increase the batch-size?

cdxeve commented 3 days ago

I mean the number of validation samples, from the above figure, your total data sample num is only 249, try adding more samples in the scored files.

zhouchang123 commented 3 days ago

batch_size: 1 model_name: "EleutherAI/gpt-neo-2.7B" output_train_file: ??? output_valid_file: ??? example_file: ??? # files containing the task inputs and its corresponding sampled prompt ids task_name: ??? prompt_pool_path: cache_dir: ??? max_length: 2048 # max seq length generate_max_len: 100 # max length to be generated

dataset_reader: target: src.dataset_readers.scorer_dsr.ScorerDatasetReader example_file: ${example_file} model_name: ${model_name} task_name: ${task_name} prompt_pool_path: ${prompt_pool_path} cache_dir: ${cache_dir} max_length: ${max_length}

Above is the config in score.yaml,it seems that no config about size. Whether I should to modify random_finder.yaml like this?

cdxeve commented 3 days ago

[UPDATE] A more elegant way is to decrease the validation batch size to 16 at uprise/DPR/conf/train/biencoder_uprise.yaml

The number of data samples of the task rte is too small, one way is to change to another task or include more tasks data in the the json files; another easy way is just to repeat the data samples in the validation json file.

zhouchang123 commented 3 days ago

If just repeat the data,it still goes wrong.

cdxeve commented 3 days ago

zhouchang123 commented 3 days ago

The problem still happened. Why ctx_vectors 0, q_vectors 0?

cdxeve commented 2 days ago

Can you show me the full log? It seems that the file is empty?

zhouchang123 commented 2 days ago

train.log Please download the log file and help look up the error.

 with torch.no_grad():
                    q_dense, ctx_dense = self.biencoder(
                        q_ids,
                        q_segments,
                        q_attn_mask,
                        ctx_ids_batch,
                        ctx_seg_batch,
                        ctx_attn_mask,
                        encoder_type=encoder_type,
                        representation_token_pos=rep_positions,
                    )

                if q_dense is not None:
                    q_represenations.extend(q_dense.cpu().split(1, dim=0))

                ctx_represenations.extend(ctx_dense.cpu().split(1, dim=0))

            batch_positive_idxs = biencoder_input.is_positive
            positive_idx_per_question.extend(
                [total_ctxs + v for v in batch_positive_idxs]
            )

            if (i + 1) % log_result_step == 0:
                logger.info(
                    "Av.rank validation: step %d, computed ctx_vectors %d, q_vectors %d",
                    i,
                    len(ctx_represenations),
                    len(q_represenations),
                )

This code is in train_dense_encoder.py line 395-423 . It seems ctx_dense is None?

cdxeve commented 2 days ago

I wonder if there is an error with your parallel settings again. Did you add CUDA_VISIBLE_DEVICES='xxx' before your running command? It seems there are 3 GPUs, but the world size is set to 1. Possibly, some GPUs are not receiving any data input for validation, so the vector count is 0. I suggest using only one GPU by setting CUDA_VISIBLE_DEVICES='0' throughout the process to avoid such errors.

zhouchang123 commented 2 days ago

After set CUDA_VISIBLE_DEVICES='0' and modify gpu_ids: 0,a new error occured. newtrain.log The error maybe happend here?

File "DPR/train_dense_encoder.py", line 396, in validate_average_rank
    q_dense, ctx_dense = self.biencoder(

zhouchang123 commented 2 days ago

After adding the code under,I found that all the tensor is on cpu and the model is on cuda:0. What's the reason of this phenomenon? In train_dense_encoder.py When init an object of class BiEncoderTrainer,the variables do not move to cuda. It seems only model and optimizer moved to cuda.

@cdxeve