Use String, Dict, and read_bytes to shorten and simplify

This is based off current nightly branch (mojo 2024.4.161). It is a demo of some clean ups that can happen now that Mojo and its stdlib have added a lot of functionality that was missing when this was originally released. There could probably also be another round to remove TensorSlice and just use List[TensorF32] for each layer of weights.

The main changes are:

replace PointerString and PointerStrings with String and List[String] respectively.
Use a Dict to lookup token indices. This allows removing Quicksort and binary search code.
small changes like using Tensor's own SIMD operations and argmax.
use read_bytes to handling of pointers without copying. That means FileBuf is no longer used.

I am not sure this is ready to merge mostly because the of handling of special bytes handling in wrap and the old print function. I tried to persevere the functionality but haven't tested extensively. Ideally we could get proper handling from String and if not fix it in stdlib.

Also, I think the stdlib is going to shift to List[UInt8] for all bytes representations, including in String. So this change could also wait until after has happened and is incorporated.

I didn't mess with Llamatune since this is going across Mojo versions but locally there was no change in tokens / sec. It is probably loading faster and more memory efficiently since this avoids the vocab sort and no longer reads entire tokenizer.bin.

tairov / llama2.mojo

Use String, Dict, and read_bytes to shorten and simplify #91