Closed hiqsociety closed 11 months ago
We don't know exactly what makes Mistral better, but likely it was just trained for longer - which is exactly what is being done with tinyllama.
This Issue Doesn't really make sense? Are you talking about mistral architecture, like sliding window? The dataset for mistral is unreleased, and there are no comments on any parts of the dataset, other than the 8T rumor? Without a plan/details, this just comes across as mistral Hype.
He likely means swapping Llama by Mistral ie swapping architecture and tokenizer on same project dataset.
@PhilippeFerreiraDeSousa yeah good point.
I'm not a huge fan of mistral architecture. My sense is the reduced attention is lossy. Other than that it's not all that different than Llama, just a bit faster.
@VatsaDev i just realised i like mistral so much (i use it more than the rest of the 7Bs). wasnt thinking clear. I thought someone can figure out how to shrink mistral to 1B or something.
anyway, I've tested tinyllama 1B and it's "great", hope to see the fixed version. current one has a lot of repetitions.
great work! can you do a mistral 1b tinyllama? mistral ai is good.