huggingface / nanotron

Minimalistic large language model 3D-parallelism training
Apache License 2.0
1.14k stars 107 forks source link

llama tests #157

Open zzhhjjj opened 5 months ago

zzhhjjj commented 5 months ago

Add end-to-end test for llama.

  1. 200 steps, 1M tokens batch size. 190M parameters llama model, tp=2, dp=4. Assert loss is lower than the target
  2. Assert examples/train_tiny_llama.sh run successfully
NouamaneTazi commented 4 months ago

We can disable flash attention automatically for old hardware:

import torch
def supports_flash_attention(device_id):
    """Check if a GPU supports FlashAttention."""
    major, minor = torch.cuda.get_device_capability(device_id)

    # Check if the GPU architecture is Ampere (SM 8.x) or newer (SM 9.0)
    is_sm8x = major == 8 and minor >= 0
    is_sm90 = major == 9 and minor == 0

    return is_sm8x or is_sm90