rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.38k stars 894 forks source link

[FEA] Inmemory cupy Arrays for TF sentencece_encoder #8750

Closed mlahir1 closed 5 months ago

mlahir1 commented 3 years ago

For sending data into sentencece_encoder the cudf series needs to converted to a series on Host. which is in-effecient, Needs to have a method, to convert these string arrays to a cupy array or some other format that can be directly loaded into TF sentencece_encoder.

example:

def get_universal_sentencece_encoder_model():
    module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" 
    model = hub.load(module_url)
    return model
sentence_model = get_universal_sentencece_encoder_model()
sentence_model(sentence_list)

sentence_list = df.a.values_host
beckernick commented 3 years ago

Hi @mlahir1 , thanks for filing an issue. Is the behavior you'd like here for a specific Tensorflow model to accept a GPU array of strings, rather than an array of strings on the CPU?

mlahir1 commented 3 years ago

yes, thats right @beckernick . Its not for a specific model, any model that takes in CPU rather than going to forth to host memory. have it in the GPU memory. @VibhuJawa can elaborate on this.

VibhuJawa commented 3 years ago

@mlahir1 . Thanks for raising the issue.

I went down the rabbit hole to figure out how we can enable this. The natural place to enable this will be in tokenizers (ie. converting text to numeric tensors), which can directly be input into tensorflow model .

Sadly there does not seem to be a straightforward way for a user to separate out tokenization from the model with Tensorflow. There are some work-arounds people use but i dont think these work for the Universal Sentence Encoder Model. (They only work for Multi-lingual Universal Sentence Encoder Model).

I have raised a question about it here .

Related Issues :

  1. https://github.com/tensorflow/hub/issues/662
  2. https://github.com/tensorflow/hub/issues/686 .

Possible Workaround:

The other alternate work-around may be to use an equivalent model in Pytorch/HuggingFace and we use something like https://huggingface.co/johngiorgi/declutr-base .

The tokenizer used in this model might not be too hard for us to implement using cudf if we really need something like this.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

vyasr commented 5 months ago

With the work going into https://github.com/rapidsai/cudf/issues/14926 we will soon expose a path to give users direct views of our data as host or device arrow arrays. Since arrow is a standardized interchange format, that will be the right approach for this going forward. cupy arrays aren't the right choice here because cupy doesn't support strings. TF already supports loading arrow data, so if this request arises again the right thing to do is to make sure the dataset loader can handle arrow device data.