Open ianberman opened 7 months ago
I've usually set up a separate conda environment for running Levanter. That said, if you want to run both codebases in the same environment, I would start by installing Levanter & then try to get the anticipation repo working with Levanter's versions of requirements. In particular, I think it should be safe to switch to transformers >= 4.29.2
in the anticipation repo's requirements.txt. Let me know if there are any surprises running anticipation with the newer versions of libraries required by Levanter & I'll be happy to take a look.
Re: Issue #7, I tried running Levanter just now & realize that things have changed a bit since I made those comments: see my latest comment on Issue #7 for instructions on getting a music model running with the latest version of Levanter. I'm working with the Levanter team & hopeful that we'll be able to avoid needing any of these hacks in the near future.
Thank you for the quick and helpful reply! It didn't occur to me I could just use a separate environment for levanter 😅
Now I am largely up and running using your suggested finetune.yaml
file, but it seems that levanter expects a safetensors file and stanford-crfm/music-medium-800k
is a pytorch_model.bin
file?
edit: I tried converting it to safetensors myself but am getting the error: AttributeError: 'NoneType' object has no attribute 'data_ptr'
edit 2: Related issue -- I'm following along with the colab example in my local environment, and noticed that the large model being in safetensors format becomes an issue :
LARGE_MODEL = 'stanford-crfm/music-large-800k'
model = AutoModelForCausalLM.from_pretrained(LARGE_MODEL).cuda()
error:
OSError: stanford-crfm/music-large-800k does not appear to have a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.
Frustrating. This change in checkpoint format happened between when I trained the smaller models, and the new large model I just released. In my own testing, it looked like the old .bin
and new .safetensors
checkpoints were largely substitutable for one another. But clearly there can be some issues.
it seems that levanter expects a safetensors file and stanford-crfm/music-medium-800k is a pytorch_model.bin file?
Oh: you just need to install torch, e.g., pip install torch
(it's not one of the default requirements for Levanter, but it's used to load checkpoints).
Re: the error loading stanford-crfm/music-large-800k
. This is indeed stored in the new .safetensor
format (the order models are stored using the old .bin
). My first thought is that the pegged version transformers 4.29.2
may be too old to support safetensor checkpoints. It seemed to be working in colab, but possibly that environment is ignoring the version request & installing something newer? Upgrading your local transformers library might fix this?
Cool, thank you. I didn't realize it was as easy as installing torch!
If it helps anyone else, I had to install everything (jax, torch, etc) for cuda 11.8; the jax for cuda 12 is too recent for torch.
I now have fine-tuning the medium model up and running on my 3090. However, it's quite slow - about 40s/it with 512 batch size; half that with 256.
I'm getting the warning /mnt/c/Users/Ian/GitHub/levanter/src/levanter/models/attention.py:89: UserWarning: transformer_engine is not installed. Please install it to use NVIDIA's optimized fused attention. Falling back to the reference implementation.
however i can enter python and import transformer_engine:
(levanter) ian@DESKTOP-MDE3NSV:~$ python
Python 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import transformer_engine
/home/ian/miniconda3/envs/levanter/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
self.pid = os.fork()
>>>
maybe this is a levanter issue, let me know if i should file an issue over there, or if I am overlooking something.
I now have fine-tuning the medium model up and running on my 3090. However, it's quite slow - about 40s/it with 512 batch size; half that with 256.
You could definitely try running with a smaller batch size! The batch size of 512 was copied from my pre-training configuration and is likely larger (maybe much larger) than it needs to be for effective finetuning. Lots of people have success fine-tuning language and image models with small batch sizes & I would expect similar results for fine-tuning these music models.
If it helps anyone else, I had to install everything (jax, torch, etc) for cuda 11.8; the jax for cuda 12 is too recent for torch.
It should be possible to get things running on cuda 12. I'm running everything on cuda 12 on linux); but maybe there are some additional complexities to running things in wsl.
maybe this is a levanter issue, let me know if i should file an issue over there, or if I am overlooking something.
Yeah, I don't know about the transformer_engine details; that's all internal to Levanter.
Thanks so much for your help with all of this!
It should be possible to get things running on cuda 12. I'm running everything on cuda 12 on linux); but maybe there are some additional complexities to running things in wsl.
Just to follow-up on this, my issue was that, if I installed torch with packaged cuda 12.1, jax with I think 12.3 is conflicting and then it can't train with cuda at all, so I just installed cuda 11.8 for both jax and torch. I tried using an older version of jax for cuda 12.1 (jax==0.4.16) but ran into issues doing this - ImportError: cannot import name 'DTypeLike' from 'jax.typing' (/usr/local/lib/python3.10/dist-packages/jax/typing.py)
Anyway, I've been trying fine-tuning runs with a batch size of 64 or 32 just to test it out. And I get as far as saving the second checkpoint. However then I get a crash with the following output. I tried on a wsl environment and in a docker environment (with wsl backend) which was installed as in levanter's instructions - same issue.
I tried on two different runs, and it seems like on both runs it crashes when saving the 2nd checkpoint. It is able to write a bunch of files to the target directory the first time around, including the .zarray file, but I don't see a lock file. the .zarray file contains the below text:
{"chunks":[],"compressor":{"id":"zstd","level":1},"dimension_separator":".","dtype":"<i4","fill_value":null,"filters":null,"order":"C","shape":[],"zarr_format":2}
Lastly, with regards to loading the large model, I'm using a separate levanter conda environment now, which I believe uses a recent transformers, so it's no trouble opening a safetensors file. However, I'm getting the following error related to configuration, maybe I just need to change something in the finetune.yaml
for this to work?
Lastly, with regards to loading the large model [...] maybe I just need to change something in the finetune.yaml for this to work?
This I can help with: yes, you need to change the finetune.yaml
, which is currently configured to describe a the architecture of the medium model. For the large model, you'll need to update the following values in the config:
hidden_dim: 1280
num_heads: 20
num_layers: 36
I get as far as saving the second checkpoint. However then I get a crash with the following output.
We are getting into questions here that might be better addressed on the Levanter issues board. I think you might be the first person to ever try running Levanter on wsl!
This error looks like it might be downstream of an out-of-memory issue:
(raylet) [2024-03-24 16:16:13,409 E 10547 10547] (raylet) node_manager.cc:2967: 14 Workers (tasks / actors) killed due to memory pressure (OOM)
Joining a bit late but what would be the appropriate torch version to use??
It was running fine for me util yesterday just by running pip install torch
but suddenly that causes a CuDNN incompatibility error .
Loaded runtime CuDNN library: 9.1.0 but source was compiled with: 9.2.0. CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
EDIT: As of the day of this comment it work fine with torch==2.3.0
Hello,
I created a new conda environment in wsl and proceeded to install levanter and anticipation from source. I was able to adapt and run the tokenization scripts for my needs and produced what seem to be non-empty tokenized files to use for fine-tuning.
However, when I install from
requirements.txt
after I install levanter, I get the following warning:levanter 1.1 requires transformers>=4.32.0, but you have transformers 4.29.2 which is incompatible.
If I proceed with training anyway, I get the following error:
This seems to indicate MistralConfig is not in the transformers package 4.29.2, so I tried to update the transformers package to several future versions such as 4.34 and 4.35, but this began to create dependency conflicts between tokenizers, transformers and huggingface-hub, so I then just used the latest transformers 4.39, which seems to be OK dependancy wise between those 3 packages. When I then try training with
python -m levanter.main.train_lm --config_path ./config/finetune.yaml
I get the following error:Lastly, I wanted to mention in your reply here https://github.com/jthickstun/anticipation/issues/7#issuecomment-1837565701
That my code for your commenting out "hack" from the latest levanter main is a bit different than the one you linked to - not sure why. The lastest version seems to be here https://github.com/stanford-crfm/levanter/blob/43712f12bddf8e2827783b6276d8d5373563d866/src/levanter/main/train_lm.py#L63
Should I check out a past version of levanter? I wasn't exactly sure based on the instructions and dates you wrote it seems to be after the latest commit to levanter.
Anyway, my hunch is that it's related to the levanter version or transformers version incompatibility, but maybe something broke in the tokenization step? Here is my finetune.yaml also. Any thoughts appreciated:)