The primary change here is to update the version of the axolotl container to correspond to the v0.4.0 release. There are also some changes directly downstream of that:
We no longer install an older checkout of transformers
Mistral no longer hangs on evaluation with flash_attention enabled
We've updated the deepspeed config paths
Additionally, I've made some updates to the configs that aren't strictly related to the axolotl version, but arose from the testing that I was doing:
I've disabled sample_packing which seems to be on net harmful for the medium-sized finetuning dataset we use in our demonstration.
(Mostly as a result of the above) I downgraded the base GPU request to use 2 40-GB A100s, which are easier to get
I aligned the configs between the three models (mainly this means removing quantization from Llama-2). I suspect that it's confusing to use different configs for different base models; new users could interpret that as "you train mistral at half native precision but have to use quantization for llama", or something similar.
Finally, I updated some of the CI that I added in a previous PR:
I removed some of the configuration changes that made the CI training "lighter weight", now all I change is running on a truncated dataset for a single epoch, with just one evaluation at the end of the epoch
I added on assertion on the validation loss. This involves some pretty hacky stuff as I don't see any obvious way to get structured results from the axolotl outputs (without going through mlflow or wandb, which maybe would have been better)
Despite being fairly lightweight and taking just a couple of minutes, the models that train in CI seem pretty good! (evaluation loss of ≈0.06 for Mistral).
The primary change here is to update the version of the
axolotl
container to correspond to the v0.4.0 release. There are also some changes directly downstream of that:transformers
flash_attention
enableddeepspeed
config pathsAdditionally, I've made some updates to the configs that aren't strictly related to the axolotl version, but arose from the testing that I was doing:
sample_packing
which seems to be on net harmful for the medium-sized finetuning dataset we use in our demonstration.Finally, I updated some of the CI that I added in a previous PR:
Despite being fairly lightweight and taking just a couple of minutes, the models that train in CI seem pretty good! (evaluation loss of ≈0.06 for Mistral).