Open bclavie opened 3 months ago
(cc @raphaelsty)
Awesome findings @bclavie, if you want to create a MR, or feel free to share some code here and I'll co-author the commit with you :)
Thanks!
My personal branch is a complete mess since it's mostly RAGatouille related, so if you don't mind I can share a few snippets!
In models/colbert.py
's __init__
:
if self.model_kwargs and self.model_kwargs.get("torch_dtype") == torch.float16 or self.model_kwargs.get("torch_dtype") == 'float16':
self.half()
in rank.py
:
def rerank(
documents_ids: list[list[int | str]],
queries_embeddings: list[list[float | int] | np.ndarray | torch.Tensor],
documents_embeddings: list[list[float | int] | np.ndarray | torch.Tensor],
device: str = None,
fp16: bool = False,
) -> list[list[dict[str, float]]]:
...
if fp16:
query_embeddings = query_embeddings.astype(np.float16)
query_documents_embeddings = [
doc_embedding.astype(np.float16) for doc_embedding in query_documents_embeddings
]
....
in retrieve/colbert.py
:
class ColBERT:
...
def __init__(self, index: Voyager, fp16: bool = False) -> None:
self.index = index
self.fp16 = fp16
def retrieve(self,
...
):
...
if self.fp16:
documents_embeddings = [
[doc_embedding.astype(np.float16) for doc_embedding in query_docs]
for query_docs in documents_embeddings
]
...
reranking_results.extend(
rerank(
documents_ids=documents_ids,
queries_embeddings=queries_embeddings_batch,
documents_embeddings=documents_embeddings,
device=device,
fp16=self.fp16,
)
)
Sorry this is messy and a bit hardcoded, as you'd probably want to have the option to also do bfloat16 for the model loading
Out of scope for the main issue but if you're looking for better support for dtypes in numpy for future improvements, ml-dtypes adds full support for bfloat16 and various fp8 implementations (including voyager-friendly E4M3) to numpy arrays.
Passing model_kwargs
to the initialization of the model should work (at least for the base transformer part, we might have to tweak a bit to also apply it to the dense layer).
I suspect this is because I had to create a subfolder for the base transformer rather than using the root folder to make the colbert-small repository compatible with both OG ColBERT and PyLate.
Thus, the loading is not done using this codeblock as usually and thus does not load properly. I had to tweak a bit the config to not add the dense layer to the base model back then.
I can explore more once I am back from vacation.
I can explore more once I am back from vacation.
May I once again suggest that you are actually off on your time off? 😂
I suspect this is because I had to create a subfolder for the base transformer rather than using the root folder to make the colbert-small repository compatible with both OG ColBERT and PyLate.
Oh I see. IMO figuring out a "perfect" solution is pretty important, especially as I hear some people maintain a late-interaction wrapper library and are really looking forward to making it completely seamless/invisible-to-the-user to switch backend back and forth between pylate and stanfordnlp. It's mostly going smoothly so far, save for some small issues, the dtype problem and having full interoperatibility between models. I'll open a separate issues in the next few days to request utils to convert models on the fly 😄
I cleaned up the loading logic in #52, to directly load the weights from stanford-nlp repository. This means that we do not have to have a subfolder for the transformer module, either it's a stanford-nlp repo or a PyLate one and so it is at the root. You'll have to rollback the colbert-small repo to previous commit to only include the stanford-nlp weights, but you can now load the model in fp16 using model_kwargs={"torch_dtype": torch.float16}
. The dense layer is also casted to the same type so the output will have the correct type.
Tell me if I am wrong but for the rest, adding this to rank.py seems enough:
query_embeddings = query_embeddings.astype(np.float16)
query_documents_embeddings = [
doc_embedding.astype(np.float16) for doc_embedding in query_documents_embeddings
]
with an attribute that can be set in rerank and is forwarded by the retrieve function. This will cast the queries and the document to fp16 (if coming from Voyager or inferred in fp32) and the created tensors will have the same type, which will make the score computation in fp16 aswell. I am missing something? I think there is more cases to handle.
Edit: this naive casting seems to hurt the performance compared to fp32 on my high-end setup.
Hey! Congrats on the release 😄
My first issue, as promised to @NohTow: scoring is pretty slow, and I think it could be greatly improved by adding extra flexibility, in terms of dtype? Noticeably:
model_kwargs={"torch_dtype": torch.float16}
results in the model being fp32, which slows things down a lot for almost no performance improvement, especially on weaker hardware. A separate.half()
call afterwards is needed.retrieve
doesn't have an option to convert the documents fetched from voyager from float32 to float16, and neither doescolbert_scores
orrerank
. This means we end up needing to do expensive float32 scoring without being able to opt out.On my machine with very dirty changes, going from the hardcoded float32 to this version took the time to eval on Scifact from ~1.35s/query to 0.85s. I think this is well worth implementing since the complexity isn't gigantic!
More minor typing flexibility change