lucidrains / mixture-of-experts

A Pytorch implementation of Sparsely-Gated Mixture of Experts, for massively increasing the parameter count of language models
MIT License
628 stars 49 forks source link

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! #5

Open mxs30443 opened 3 years ago

mxs30443 commented 3 years ago

/moe.py", line 247, in noisy_top_k_gating load = (self._prob_in_top_k(clean_logits, noisy_logits, noise_stddev, top_logits)).sum(0) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Lewington-pitsos commented 2 months ago

you need to move the MoE to cuda

# init moe on CPU
moe = MoE(
    dim = 768,
    num_experts = 32,               # increase the experts (# parameters) of your model without increasing computation
    hidden_dim = 768,           # size of hidden dimension in each expert, defaults to 4 * dimension
    activation = nn.ReLU,      # use your preferred activation, will default to GELU
    second_policy_train = 'random', # in top_2 gating, policy for whether to use a second-place expert
    second_policy_eval = 'random',  # all (always) | none (never) | threshold (if gate value > the given threshold) | random (if gate value > threshold * random_uniform(0, 1))
    second_threshold_train = 0.2,
    second_threshold_eval = 0.2,
    capacity_factor_train = 1.25,   # experts have fixed capacity per batch. we need some extra capacity in case gating is not perfectly balanced.
    capacity_factor_eval = 2.,      # capacity_factor_* should be set to a value >=1
    loss_coef = 1e-2                # multiplier on the auxiliary expert balancing auxiliary loss
)

#move to GPU
moe.to('cuda')