Closed mikowals closed 12 months ago
Hi @mikowals , yes, scores are pointing to a wrong place in mem. But I can't understand how the tokens are working properly then :)
Also the next line is redundant. could you please remove next line as well?
@mikowals thank you !
I removed the offset incrementing line as requested.
I don't think the scores are working as well as they could. I think most of them just go unused. Karpathy has some comments in his repo about getting good results only using 4096 tokens.
I found this problem because I was sorting the vocab and scores to do binary search and began getting signal SIGSEGV: invalid address (fault address: 0x0)
. So I think there were problems they were just subtle to detect without looping and reading all the scores.
That's good catch!
tok.vocab_scores = buf.data.offset(buf.offset).bitcast[DType.float32]()
appears to be overwriting the scores with junk data.buf.offset
is expected to be at the end of the buffer at that point because all data has been read with the offset advanced by each read.The
tokenizer.bin
file format can be found here.