Closed mlahir1 closed 5 months ago
Hi @mlahir1 , thanks for filing an issue. Is the behavior you'd like here for a specific Tensorflow model to accept a GPU array of strings, rather than an array of strings on the CPU?
yes, thats right @beckernick . Its not for a specific model, any model that takes in CPU rather than going to forth to host memory. have it in the GPU memory. @VibhuJawa can elaborate on this.
@mlahir1 . Thanks for raising the issue.
I went down the rabbit hole to figure out how we can enable this. The natural place to enable this will be in tokenizers (ie. converting text to numeric tensors), which can directly be input into tensorflow model .
Sadly there does not seem to be a straightforward way for a user to separate out tokenization from the model with Tensorflow. There are some work-arounds people use but i dont think these work for the Universal Sentence Encoder Model. (They only work for Multi-lingual Universal Sentence Encoder Model).
I have raised a question about it here .
Related Issues :
The other alternate work-around may be to use an equivalent model in Pytorch/HuggingFace and we use something like https://huggingface.co/johngiorgi/declutr-base .
The tokenizer used in this model might not be too hard for us to implement using cudf if we really need something like this.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
With the work going into https://github.com/rapidsai/cudf/issues/14926 we will soon expose a path to give users direct views of our data as host or device arrow arrays. Since arrow is a standardized interchange format, that will be the right approach for this going forward. cupy arrays aren't the right choice here because cupy doesn't support strings. TF already supports loading arrow data, so if this request arises again the right thing to do is to make sure the dataset loader can handle arrow device data.
For sending data into sentencece_encoder the cudf series needs to converted to a series on Host. which is in-effecient, Needs to have a method, to convert these string arrays to a cupy array or some other format that can be directly loaded into TF sentencece_encoder.
example: