Open zhouchang123 opened 3 days ago
The data amount is too small for validation, try increasing the data amount.
Do you mean increase the batch-size?
I mean the number of validation samples, from the above figure, your total data sample num is only 249, try adding more samples in the scored files.
batch_size: 1 model_name: "EleutherAI/gpt-neo-2.7B" output_train_file: ??? output_valid_file: ??? example_file: ??? # files containing the task inputs and its corresponding sampled prompt ids task_name: ??? prompt_pool_path: cache_dir: ??? max_length: 2048 # max seq length generate_max_len: 100 # max length to be generated
dataset_reader: target: src.dataset_readers.scorer_dsr.ScorerDatasetReader example_file: ${example_file} model_name: ${model_name} task_name: ${task_name} prompt_pool_path: ${prompt_pool_path} cache_dir: ${cache_dir} max_length: ${max_length}
Above is the config in score.yaml,it seems that no config about size.
Whether I should to modify random_finder.yaml like this?
[UPDATE] A more elegant way is to decrease the validation batch size to 16 at uprise/DPR/conf/train/biencoder_uprise.yaml
The number of data samples of the task rte
is too small, one way is to change to another task or include more tasks data in the the json files; another easy way is just to repeat the data samples in the validation json file.
If just repeat the data,it still goes wrong.
The problem still happened.
Why ctx_vectors 0, q_vectors 0?
Can you show me the full log? It seems that the file is empty?
train.log Please download the log file and help look up the error.
with torch.no_grad():
q_dense, ctx_dense = self.biencoder(
q_ids,
q_segments,
q_attn_mask,
ctx_ids_batch,
ctx_seg_batch,
ctx_attn_mask,
encoder_type=encoder_type,
representation_token_pos=rep_positions,
)
if q_dense is not None:
q_represenations.extend(q_dense.cpu().split(1, dim=0))
ctx_represenations.extend(ctx_dense.cpu().split(1, dim=0))
batch_positive_idxs = biencoder_input.is_positive
positive_idx_per_question.extend(
[total_ctxs + v for v in batch_positive_idxs]
)
if (i + 1) % log_result_step == 0:
logger.info(
"Av.rank validation: step %d, computed ctx_vectors %d, q_vectors %d",
i,
len(ctx_represenations),
len(q_represenations),
)
This code is in train_dense_encoder.py line 395-423 . It seems ctx_dense is None?
I wonder if there is an error with your parallel settings again. Did you add CUDA_VISIBLE_DEVICES='xxx' before your running command? It seems there are 3 GPUs, but the world size is set to 1. Possibly, some GPUs are not receiving any data input for validation, so the vector count is 0. I suggest using only one GPU by setting CUDA_VISIBLE_DEVICES='0' throughout the process to avoid such errors.
After set CUDA_VISIBLE_DEVICES='0'
and modify gpu_ids: 0
,a new error occured.
newtrain.log
The error maybe happend here?
File "DPR/train_dense_encoder.py", line 396, in validate_average_rank
q_dense, ctx_dense = self.biencoder(
After adding the code under,I found that all the tensor is on cpu and the model is on cuda:0.
What's the reason of this phenomenon?
In train_dense_encoder.py
When init an object of class BiEncoderTrainer,the variables do not move to cuda.
It seems only model and optimizer moved to cuda.
@cdxeve
Here is the config and the log info.
Here is the wrong message.
Everythings seems correct but until the first epoch goes end.
What's the reason of this problem?