Using base model on GPU with no bfloat16

@yichen0104 The library underneath actually supports it, the problem is just that dtype is not exposed via the CLI. I was able to make it work on my 2x3060+2xP100 machine by applying the following patch:

diff --git a/src/mistral_inference/main.py b/src/mistral_inference/main.py
index a5ef3a0..d97c4c9 100644
--- a/src/mistral_inference/main.py
+++ b/src/mistral_inference/main.py
@@ -42,7 +42,7 @@ def load_tokenizer(model_path: Path) -> MistralTokenizer:

 def interactive(
     model_path: str,
-    max_tokens: int = 35,
+    max_tokens: int = 512,
     temperature: float = 0.7,
     num_pipeline_ranks: int = 1,
     instruct: bool = False,
@@ -62,7 +62,7 @@ def interactive(
     tokenizer: Tokenizer = mistral_tokenizer.instruct_tokenizer.tokenizer

     transformer = Transformer.from_folder(
-        Path(model_path), max_batch_size=3, num_pipeline_ranks=num_pipeline_ranks
+        Path(model_path), max_batch_size=3, num_pipeline_ranks=num_pipeline_ranks, dtype=torch.float16
     )

     # load LoRA

mistralai / mistral-inference

Using base model on GPU with no bfloat16 #163