Add dtype flexibility - Githubissues

bclavie commented 3 months ago

Hey! Congrats on the release 😄

My first issue, as promised to @NohTow: scoring is pretty slow, and I think it could be greatly improved by adding extra flexibility, in terms of dtype? Noticeably:

The model dtype defaults to float32, and even passing in model_kwargs={"torch_dtype": torch.float16} results in the model being fp32, which slows things down a lot for almost no performance improvement, especially on weaker hardware. A separate .half() call afterwards is needed.
Likewise, retrieve doesn't have an option to convert the documents fetched from voyager from float32 to float16, and neither does colbert_scores or rerank. This means we end up needing to do expensive float32 scoring without being able to opt out.
Due to the above, all scoring, etc, is done with hardcoded float32 that the user cannot control. This is a noticeable slowdown on a few machines that I've tried, and adds virtually no precision (the original colbert library even hardcoded float16 as the default type).

On my machine with very dirty changes, going from the hardcoded float32 to this version took the time to eval on Scifact from ~1.35s/query to 0.85s. I think this is well worth implementing since the complexity isn't gigantic!

More minor typing flexibility change

Voyager is also hardcoded to use fp32. This is less impactful since it doesn't have fp16 support, but users might want to use E4M3 float8, which introduces tolerable degradation in exchange for better storage.

bclavie commented 3 months ago

(cc @raphaelsty)

raphaelsty commented 3 months ago

Awesome findings @bclavie, if you want to create a MR, or feel free to share some code here and I'll co-author the commit with you :)

bclavie commented 3 months ago

Thanks!

My personal branch is a complete mess since it's mostly RAGatouille related, so if you don't mind I can share a few snippets!

In models/colbert.py's __init__:

        if self.model_kwargs and self.model_kwargs.get("torch_dtype") == torch.float16 or self.model_kwargs.get("torch_dtype") == 'float16':
            self.half()

in rank.py:

def rerank(
    documents_ids: list[list[int | str]],
    queries_embeddings: list[list[float | int] | np.ndarray | torch.Tensor],
    documents_embeddings: list[list[float | int] | np.ndarray | torch.Tensor],
    device: str = None,
    fp16: bool = False,
) -> list[list[dict[str, float]]]:
...
if fp16:
    query_embeddings = query_embeddings.astype(np.float16)
    query_documents_embeddings = [
            doc_embedding.astype(np.float16) for doc_embedding in query_documents_embeddings
        ]
....

in retrieve/colbert.py:

class ColBERT:
...
   def __init__(self, index: Voyager, fp16: bool = False) -> None:
        self.index = index
        self.fp16 = fp16
   def retrieve(self,
   ...
   ):
   ...
   if self.fp16:
       documents_embeddings = [
                [doc_embedding.astype(np.float16) for doc_embedding in query_docs]
                for query_docs in documents_embeddings
            ]
   ...
   reranking_results.extend(
                rerank(
                    documents_ids=documents_ids,
                    queries_embeddings=queries_embeddings_batch,
                    documents_embeddings=documents_embeddings,
                    device=device,
                    fp16=self.fp16,
                )
            )

Sorry this is messy and a bit hardcoded, as you'd probably want to have the option to also do bfloat16 for the model loading

bclavie commented 3 months ago

Out of scope for the main issue but if you're looking for better support for dtypes in numpy for future improvements, ml-dtypes adds full support for bfloat16 and various fp8 implementations (including voyager-friendly E4M3) to numpy arrays.

NohTow commented 3 months ago

Passing model_kwargs to the initialization of the model should work (at least for the base transformer part, we might have to tweak a bit to also apply it to the dense layer). I suspect this is because I had to create a subfolder for the base transformer rather than using the root folder to make the colbert-small repository compatible with both OG ColBERT and PyLate. Thus, the loading is not done using this codeblock as usually and thus does not load properly. I had to tweak a bit the config to not add the dense layer to the base model back then.

I can explore more once I am back from vacation.

bclavie commented 2 months ago

I can explore more once I am back from vacation.

May I once again suggest that you are actually off on your time off? 😂

I suspect this is because I had to create a subfolder for the base transformer rather than using the root folder to make the colbert-small repository compatible with both OG ColBERT and PyLate.

Oh I see. IMO figuring out a "perfect" solution is pretty important, especially as I hear some people maintain a late-interaction wrapper library and are really looking forward to making it completely seamless/invisible-to-the-user to switch backend back and forth between pylate and stanfordnlp. It's mostly going smoothly so far, save for some small issues, the dtype problem and having full interoperatibility between models. I'll open a separate issues in the next few days to request utils to convert models on the fly 😄

NohTow commented 2 months ago

I cleaned up the loading logic in #52, to directly load the weights from stanford-nlp repository. This means that we do not have to have a subfolder for the transformer module, either it's a stanford-nlp repo or a PyLate one and so it is at the root. You'll have to rollback the colbert-small repo to previous commit to only include the stanford-nlp weights, but you can now load the model in fp16 using model_kwargs={"torch_dtype": torch.float16}. The dense layer is also casted to the same type so the output will have the correct type.

Tell me if I am wrong but for the rest, adding this to rank.py seems enough:

query_embeddings = query_embeddings.astype(np.float16)
query_documents_embeddings = [
            doc_embedding.astype(np.float16) for doc_embedding in query_documents_embeddings
        ]

with an attribute that can be set in rerank and is forwarded by the retrieve function. This will cast the queries and the document to fp16 (if coming from Voyager or inferred in fp32) and the created tensors will have the same type, which will make the score computation in fp16 aswell. I am missing something? I think there is more cases to handle.

Edit: this naive casting seems to hurt the performance compared to fp32 on my high-end setup.

lightonai / pylate

Add dtype flexibility #49