Closed yanghou2000 closed 1 month ago
my_checkpoint_path = "/repo/timesfm_model/checkpoints"
tfm.load_from_checkpoint(checkpoint_path=my_checkpoint_path)
is the correct way to call. One "checkpoint" is the set of everything under checkpoint_1100000, and here we point to the parent directory.
Quick question, can you run model inference despite the error message in (1), or is it interrupting?
my_checkpoint_path = "/repo/timesfm_model/checkpoints" tfm.load_from_checkpoint(checkpoint_path=my_checkpoint_path)
is the correct way to call. One "checkpoint" is the set of everything under checkpoint_1100000, and here we point to the parent directory.
Quick question, can you run model inference despite the error message in (1), or is it interrupting?
Thank you for your swift reply! After using the parent directory, the model inference can be ran after I give more memory in SBATCH when submitting the slurm job. In other words, the previous error in (1) is caused by out of memory issue instead of any bug in the code.
Let me summerize and close this issue for now.
My working example code is as below:
# Load timesfm model
tfm = timesfm.TimesFm(
context_len=480,
horizon_len=14,
input_patch_len=32, # fixed
output_patch_len=128, # fixed
num_layers=20, # fixed
model_dims=1280, # fixed
backend="cpu",
)
tfm.load_from_checkpoint(checkpoint_path="/repo/timesfm_model/checkpoints")
/repo/timesfm_model/checkpoints
instead of pointing to the checkpoint model file /repo/timesfm_model/checkpoints/checkpoint_1100000/state/checkpoint
ERROR:absl:For checkpoint version > 1.0, we require users to provide
`train_state_unpadded_shape_dtype_struct` during checkpoint
saving/restoring, to avoid potential silent bugs when loading
checkpoints to incompatible unpadded shapes of TrainState.
I also encountered the same problem, and I also made sure that the directory was correct, but it still prompted this error:
[*********************100%%**********************] 1 of 1 completed
Constructing model weights.
WARNING:absl:No registered CheckpointArgs found for handler type: <class 'paxml.checkpoints.FlaxCheckpointHandler'>
WARNING:absl:Configured `CheckpointManager` using deprecated legacy API. Please follow the instructions at https://orbax.readthedocs.io/en/latest/api_refactor.html to migrate by May 1st, 2024.
WARNING:absl:train_state_unpadded_shape_dtype_struct is not provided. We assume `train_state` is unpadded.
Constructed model weights in 2.20 seconds.
Restoring checkpoint from /home/wd/PycharmProjects/timesfm-1.0-200m/checkpoints/.
Restored checkpoint in 4.69 seconds.
Jitting decoding.
ERROR:absl:For checkpoint version > 1.0, we require users to provide
`train_state_unpadded_shape_dtype_struct` during checkpoint
saving/restoring, to avoid potential silent bugs when loading
checkpoints to incompatible unpadded shapes of TrainState.
Process finished with exit code 137 (interrupted by signal 9:SIGKILL)
WARNING:absl:No registered CheckpointArgs found for handler type: <class 'paxml.checkpoints.FlaxCheckpointHandler'> WARNING:absl:Configured CheckpointManager using deprecated legacy API. Please follow the instructions at https://orbax.readthedocs.io/en/latest/api_refactor.html to migrate by May 1st, 2024. WARNING:absl:train_state_unpadded_shape_dtype_struct is not provided. We assume train_state is unpadded. ERROR:absl:For checkpoint version > 1.0, we require users to provide train_state_unpadded_shape_dtype_struct during checkpoint saving/restoring, to avoid potential silent bugs when loading checkpoints to incompatible unpadded shapes of TrainState. Restored checkpoint in 0.75 seconds. Jitting decoding. Killed
请问这个错误提示是哪里出了问题?我的:orbax-checkpoint是0.5.9版本。
各位朋友,找到原因了。是wsl的内存不够,需要把wsl的内存搞到16g以上。
I also encountered the same problem, and I also made sure that the directory was correct, but it still prompted this error:
[*********************100%%**********************] 1 of 1 completed Constructing model weights. WARNING:absl:No registered CheckpointArgs found for handler type: <class 'paxml.checkpoints.FlaxCheckpointHandler'> WARNING:absl:Configured `CheckpointManager` using deprecated legacy API. Please follow the instructions at https://orbax.readthedocs.io/en/latest/api_refactor.html to migrate by May 1st, 2024. WARNING:absl:train_state_unpadded_shape_dtype_struct is not provided. We assume `train_state` is unpadded. Constructed model weights in 2.20 seconds. Restoring checkpoint from /home/wd/PycharmProjects/timesfm-1.0-200m/checkpoints/. Restored checkpoint in 4.69 seconds. Jitting decoding. ERROR:absl:For checkpoint version > 1.0, we require users to provide `train_state_unpadded_shape_dtype_struct` during checkpoint saving/restoring, to avoid potential silent bugs when loading checkpoints to incompatible unpadded shapes of TrainState. Process finished with exit code 137 (interrupted by signal 9:SIGKILL)
I think this is due to the issue of lack of memory. Try again by giving your program more memory
Background
Linux x86 timesfm cpu version use slurm to submit job. already ensure that conda env is activated after using SBATCH and before running python code
Code that ran into error
Description
I downloaded the model checkpoint from a huggingface mirror website, and stored to this path:
/repo/timesfm_model/checkpoints/checkpoint_1100000/state/checkpoint
. I'm not sure what is the right path to input checkpoint_path intfm.load_from_checkpoint(checkpoint_path=my_checkpoint_path)
The questions i want to ask is
my_checkpoint_path
in my case? I tried all possible choices and didn't work out, with error messages showing belowError message
When I use
the corresponding error message is:
When i use
the error message is like this
When i use
the error message is