jrzaurin / pytorch-widedeep

A flexible package for multimodal-deep-learning to combine tabular data with text and images using Wide and Deep models in Pytorch
Apache License 2.0
1.27k stars 188 forks source link

a question #217

Closed xylovezxy closed 1 month ago

xylovezxy commented 1 month ago

Hi! If I want to use clip to extract image and text features, can I use this library to train clip together?

jrzaurin commented 1 month ago

Hey @xylovezxy

my first comment is that you can extract text and image features using CLIP and then feed them into the library as continuos features of a tabular dataset, for sure.

my second comment is training together against what? I mean, is your problem that you have text and images and you want to classify something?

Finally, my third comment is that this library allows you to use ANY model you want as long as it has a property called output_dim (loot at the examples, or here: https://github.com/jrzaurin/pytorch-widedeep/blob/master/pytorch_widedeep/models/_base_wd_model_component.py

so you can do:

# note that BaseWDModelComponent is simply a wrap up around nn.Module to force the presence of 'output_dim'
from pytorch_widedeep.models._base_wd_model_component.py import BaseWDModelComponent

my_text_processor = MyTextProcessor(...)
my_image_processor = MyImageProcessor()

X_text =  my_text_processor(texts)
X_img = my_image_processor(images)

class MyClip(BaseWDModelComponent):
    ....

# here in this example the weights would be shared
my_text_component = MyClip(...)
my_image_component = MyClip(...)

model = WideDeep(deeptext=my_text_component, deepimage=my_image_component, pred_dim=...)

# Proceed as usual with the training

Let me know if this helps

xylovezxy commented 1 month ago

For the Movielens dataset, I would like to set the target to 0 and 1 based on ratings, which can be said to be a binary classification. I want to use clip to extract image and text features to increase the features on the item side