Open BDHU opened 2 years ago
@BDHU can you try to run a simple mpi example before starting the triton server in order to make sure MPI works as expected ?
It could be the case that pmi2
doesn't work with your slurm system, or you could try pmix
.
You can check this by running:
srun --mpi=list
Moreover, your triton server seems to be based on the older version of ft_triton_backend (Invalid configuration argument 'is_half': stoi
). You may need to update the triton_backend.so by setting CMD = "cp $WORKSPACE/fastertransformer_backend/build/libtriton_fastertransformer.so $WORKSPACE/fastertransformer_backend/build/lib/libtransformer-shared.so /opt/tritonserver/backends/fastertransformer; /opt/tritonserver/bin/tritonserver --model-repository=$WORKSPACE/all_models/gptj"
@PerkzZheng Thanks for the reply. I did try using pmix
and I used the ring_c.c example in the ompi repo. I am able to run that program successfully on two nodes with: srun --mpi=pmix -N 2 ./a.out
.
Running srun --mpi=list
shows:
MPI plugin types are...
pmix
cray_shasta
pmi2
none
specific pmix plugin versions available: pmix_v4
However, even with pmix
the following error persists:
[node5:45239] *** An error occurred in MPI_Bcast
[node5:45239] *** reported by process [570632255,1]
[node5:45239] *** on communicator MPI_COMM_WORLD
[node5:45239] *** MPI_ERR_TRUNCATE: message truncated
[node5:45239] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node5:45239] *** and potentially your MPI job)
slurmstepd: error: *** STEP 173.0 ON node4 CANCELLED AT 2022-10-13T23:00:52 ***
@BDHU can you share the config.pbtx
, and config.ini
(generated when converting the checkpoint) ?
Both files are in the /${workspace}/all_models/gptj/fastertransformer
directory.
Here is the config.pbtx
:
name: "fastertransformer"
backend: "fastertransformer"
default_model_filename: "gpt-j-6b"
max_batch_size: 1024
model_transaction_policy {
decoupled: False
}
input [
{
name: "input_ids"
data_type: TYPE_UINT32
dims: [ -1 ]
},
{
name: "start_id"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "end_id"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "input_lengths"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
},
{
name: "request_output_len"
data_type: TYPE_UINT32
dims: [ -1 ]
},
{
name: "runtime_top_k"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "runtime_top_p"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "beam_search_diversity_rate"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "temperature"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "len_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "repetition_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "random_seed"
data_type: TYPE_UINT64
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "is_return_log_probs"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "beam_width"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "bad_words_list"
data_type: TYPE_INT32
dims: [ 2, -1 ]
optional: true
},
{
name: "stop_words_list"
data_type: TYPE_INT32
dims: [ 2, -1 ]
optional: true
},
{
name: "prompt_learning_task_name_ids"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
}
]
output [
{
name: "output_ids"
data_type: TYPE_UINT32
dims: [ -1, -1 ]
},
{
name: "sequence_length"
data_type: TYPE_UINT32
dims: [ -1 ]
},
{
name: "cum_log_probs"
data_type: TYPE_FP32
dims: [ -1 ]
},
{
name: "output_log_probs"
data_type: TYPE_FP32
dims: [ -1, -1 ]
}
]
instance_group [
{
count: 1
kind: KIND_CPU
}
]
parameters {
key: "tensor_para_size"
value: {
string_value: "4"
}
}
parameters {
key: "pipeline_para_size"
value: {
string_value: "2"
}
}
parameters {
key: "data_type"
value: {
string_value: "fp16"
}
}
parameters {
key: "model_type"
value: {
string_value: "GPT-J"
}
}
parameters {
key: "model_checkpoint_path"
value: {
#string_value: "/data/models/GPT-J/EleutherAI/gptj-model/c-model/4-gpu/"
string_value: "/workspace/all_models/gptj/fastertransformer/1/4-gpu/"
}
}
parameters {
key: "enable_custom_all_reduce"
value: {
string_value: "0"
}
}
And here is the config.ini
:
[gptj]
model_name = gptj-6B
head_num = 16
size_per_head = 256
inter_size = 16384
num_layer = 28
rotary_embedding = 64
vocab_size = 50400
start_id = 50256
end_id = 50256
weight_data_type = fp32
so you have 4 GPUs for each node?
so you have 4 GPUs for each node?
That's correct, 4 V100 on each node.
can you share the full logs (attached it)? I don't see any noticeable clues in the above log.
Here's the log from running srun
:
inference_server.log
I've also attached the slurmd.log
file from both nodes just in case:
Per your instruction on rebuilding fastertransformer_backend
, it seems like after rebuilding the error was now related to NCCL:
I suspect it has something to do with the way two nodes are connected? Since I only use TCP between these two nodes, maybe NCCL is not compatible with tcp?
I also tried to change the network interface for NCCL using export NCCL_SOCKET_IFNAME=eno4
(the TCP connection), which creates new error:
I guess the problem has something to do with the cross-node communication? Perhaps there is a way to specify that in config.pbtx
?
Per your instruction on rebuilding fastertransformer_backend, it seems like after rebuilding the error was now related to NCCL: I suspect it has something to do with the way two nodes are connected? Since I only use TCP between these two nodes, maybe NCCL is not compatible with tcp?
you can try to add NCCL_DEBUG=INFO
, which will give further information. And run nccl-tests to make sure NCCL works as expected. It could be a problem when NCCL tries to create the communicator among nodes.
I followed the tutorial provided here. I am able to run GPTJ-B on a single node. However, when I try the multi-node inference example with the following command on two nodes:
It shows the following error in the log file:
Is there any hint over how to resolve this issue? Thanks!