CUDA out of memory when Train llama-7b-hf single gpu

Hi, I am trying to train the model llama-7b-hf with single GPU. I tried to reduce some parameters but I don't know if they are better.

Components of my pc :

AMD Ryzen 5 7600 6-Core Processor
64 Go Ram
GeForce RTX 3060 12go VRAM

Command execution :

torchrun --nproc_per_node=1 --master_port=8888 train.py \
    --model_name_or_path /var/llama/llama-7b-hf \
    --data_path ./alpaca_data.json \
    --bf16 True \
    --output_dir out/ \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True

Error :

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 774.00 MiB (GPU 0; 11.76 GiB total capacity; 10.58 GiB already allocated; 697.94 MiB free; 10.61 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 496499) of binary: /usr/bin/python3

I started to take an interest in AI recently, I am grateful in advance for the people who will help me.

Edit : If there is also a way to learn with only the cpu, I am also interested

tatsu-lab / stanford_alpaca

CUDA out of memory when Train llama-7b-hf single gpu #218