Closed dlwh closed 2 months ago
It would be great if a sample configuration for mistral-7b is made available for loading from a local checkpoint. I still note that HF is referred to for loading the checkpoint, but probably that's something to do with my configuration. What did I try? In the configuration (mistral-7b),
initialize_from: /path/to/local/checkpoint
that results in the following error,
draccus.utils.DecodingError: The fields 'initialize_from' are not valid for TrainLmConfig
`initialize_from_hf:/path/to/local/checkpoint' tries to find the checkpoint from HF
Any advise please?
does /path/to/local/checkpoint/model.safetensors.index.json
not exist?
On Mon, Sep 9, 2024 at 10:16 AM mhmaqbool @.***> wrote:
It would be great if a sample configuration for mistral-7b is made available for loading from a local checkpoint. I still note that HF is referred to for loading the checkpoint, but probably that's something to do with my configuration. What did I try? In the configuration (mistral-7b), 1. initialize_from: /path/to/local/checkpoint that results in the following error, draccus.utils.DecodingError: The fields 'initialize_from' are not valid for TrainLmConfig 2. `initialize_from_hf:/path/to/local/checkpoint' tries to find the checkpoint from HF
Any advise please?
— Reply to this email directly, view it on GitHub https://github.com/stanford-crfm/levanter/pull/715#issuecomment-2338652965, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAACLIM44HJOA72DQ4PP4NDZVXJWTAVCNFSM6AAAAABNSXKAE6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMZYGY2TEOJWGU . You are receiving this because you modified the open/close state.Message ID: @.***>
Ahhh, the checkpoint I have doesn't contain model.safetensors.index.json rather it's a plane pytorch checkpoint. I assumed there might be a conversion procedure in place. Could you please point to the checkpoint mistral expects locally if possible?
Pytorch should work too if pytorch is installed and "pytorch_model.bin.index.json" or "pytorch_model.bin" exists
On Mon, Sep 9, 2024 at 10:36 AM mhmaqbool @.***> wrote:
Ahhh, the checkpoint I have doesn't contain model.safetensors.index.json rather it's a plane pytorch checkpoint. I assumed there might be a conversion procedure in place. Could you please point to the checkpoint mistral expects locally if possible?
— Reply to this email directly, view it on GitHub https://github.com/stanford-crfm/levanter/pull/715#issuecomment-2338688168, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAACLIO7LU4GVV2DVXAMJ5DZVXMAHAVCNFSM6AAAAABNSXKAE6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMZYGY4DQMJWHA . You are receiving this because you modified the open/close state.Message ID: @.***>
Yeah, I have PyTorch installed and tried again. Following files are available in the checkpoint
Still tries to refer to HF website for getting the checkpoint? This is my configuration,
` data: train_urls: ["/pth/to/train/json"] validation_urls: ["/pth/to/val/json/"] tokenizer: "/local/mistral-7b/checkpoint" model: type: mistral initialize_from_hf: "/local/mistral-7b/checkpoint" use_hf_model_config: true trainer: wandb: project: "levanter" tags: ["openwebtext", "mistral"]
mp: p=f32,c=bfloat16 train_batch_size: 256 # set for v4-64 TPU num_train_steps: 1000 steps_per_eval: 50 tensor_parallel_axes: ["mlp", "heads"] fsdp_axis: "embed" batch_axis: "batch" optimizer: learning_rate: 1.2E-5 # set low for fine-tuning weight_decay: 0.1 min_lr_ratio: 0.1 `
So we should actually respect the config in the HF checkpoint directory since that's usually right. A bunch of tests were broken, but we didn't notice because they all need torch and we don't install torch in CI... (I guess we should)
Fixes #681 ( I think)