How to handle inputs in the pytorch-widedeep library if the dataset already contains vectorized encodings for text and image columns?

stoensin commented 3 months ago

Thanks to all the developers for their tremendous contributions, this open-source project is fantastic.

As mentioned in the title: when we have already used trained deep learning models to encode some text columns and image models to encode image columns, obtaining their respective embedding vectors, how do we handle such data in the development process of pytorch-widedeep? I am very much looking forward to someone being able to answer my question, thank you!!!

jrzaurin commented 3 months ago

Hey @stoensin

thanks for the comments 😊.

Treat them all as continuous cols

from pytorch_widedeep import Trainer
from pytorch_widedeep.models import TabMlp, WideDeep,
from pytorch_widedeep.preprocessing import TabPreprocessor

if __name__ == "__main__":

    # your dataframe with 100s of continuos/numerical columns
    df = pd.DataFrame(...)

    continuous_cols = [c for c in df.columns]

    target = "your_target_col"
    target = df[target].values

    # Explore other params if you wanted 
    tab_preprocessor = TabPreprocessor(continuous_cols=continuous_cols)
    X_tab = tab_preprocessor.fit_transform(df)

    # we could use a "simple" MLP if all we have is a table
    tab_mlp = TabMlp(
        column_idx=tab_preprocessor.column_idx,
        continuous_cols=continuous_cols,
        ...,   # the params you might want   
    )

    # proceed as in the examples (as usual)
    ....

stoensin commented 3 months ago

Thank you for your reply. When I use the method you suggested, torch will give an abnormal error message: -> "TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found object", the picture below is my input data:

the first three fields have been processed by a trained embedding model，The dimension of each vector is 512.

Is there a way in pytorch_widedeep to directly apply these vectorized features?

jrzaurin commented 3 months ago

Can you please paste here a small sample of the dataframe? or a dataframe that would be equivalent and you have no problem sharing?

jrzaurin commented 3 months ago

in the meantime here you have a fully functional example with a df that has all numerical columns apart from a binary target

import numpy as np
import pandas as pd
import torch

from pytorch_widedeep import Trainer
from pytorch_widedeep.metrics import Accuracy
from pytorch_widedeep.models import TabMlp, WideDeep
from pytorch_widedeep.preprocessing import TabPreprocessor

use_cuda = torch.cuda.is_available()

if __name__ == "__main__":

    np.random.seed(42)

    data = np.random.randn(100, 10)

    target = np.random.randint(0, 2, 100)

    # Combine the data into a DataFrame
    df = pd.DataFrame(data, columns=[f"col_{i+1}" for i in range(10)])

    # Add the target column
    df["target"] = target

    continuous_cols = [f"col_{i+1}" for i in range(10)]
    target = "target"
    target = df[target].values

    tab_preprocessor = TabPreprocessor(continuous_cols=continuous_cols)
    X_tab = tab_preprocessor.fit_transform(df)

    tab_mlp = TabMlp(
        column_idx=tab_preprocessor.column_idx,
        continuous_cols=continuous_cols,
        mlp_hidden_dims=[32, 16],
    )

    model = WideDeep(deeptabular=tab_mlp)

    trainer = Trainer(
        model,
        objective="binary",
        metrics=[Accuracy],
    )

    trainer.fit(
        X_tab=X_tab,
        target=target,
        n_epochs=2,
        batch_size=16,
        val_split=0.2,
    )

stoensin commented 3 months ago

test (1).csv this is my test data.

and the error is :

jrzaurin commented 3 months ago

what machine are you using?

I will look into this later, in the meantime, can you try setting num_workers=1 when you instantiate the Trainer?

trainer = Trainer(
        model,
        objective="binary",
        metrics=[Accuracy],
        num_workers=1, 
    )

stoensin commented 3 months ago

my machine is Linux h3c 5.15.0-105-generic #115-Ubuntu SMP Mon Apr 15 09:52:04 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux Ubuntu 22.04

while setting to trainer = Trainer( model, objective="rmse", num_workers=1, )
it meet the same errors: ` TypeError: Caught TypeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/root/anaconda3/lib/python3.11/site-packages/torch/utils/data/_utils/collate.py", line 154, in collate clone.update({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem}) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/utils/data/_utils/collate.py", line 154, in clone.update({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem}) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/utils/data/_utils/collate.py", line 141, in collate return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/utils/data/_utils/collate.py", line 220, in collate_numpy_array_fn raise TypeError(default_collate_err_msg_format.format(elem.dtype)) TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found object

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/root/anaconda3/lib/python3.11/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop data = fetcher.fetch(index) # type: ignore[possibly-undefined] ^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch return self.collate_fn(data) ^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/utils/data/_utils/collate.py", line 316, in default_collate return collate(batch, collate_fn_map=default_collate_fn_map) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/utils/data/_utils/collate.py", line 173, in collate return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed] # Backwards compatibility. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/utils/data/_utils/collate.py", line 173, in return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed] # Backwards compatibility. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/utils/data/_utils/collate.py", line 161, in collate return {key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem} ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/utils/data/_utils/collate.py", line 161, in return {key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem} ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/utils/data/_utils/collate.py", line 141, in collate return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/utils/data/_utils/collate.py", line 220, in collate_numpy_array_fn raise TypeError(default_collate_err_msg_format.format(elem.dtype)) TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found object `

jrzaurin commented 3 months ago

@stoensin

in your dataframe, the 1st 3 columns are strings when as you read them with pd.read_csv. Is a list of numbers, but when read by pandas you will see that is not even in the right format

>>> import pandas as pd
>>> df = pd.read_csv("test.csv")
>>> type(df.categories.tolist()[0])
<class 'str'>
>>> df.categories.tolist()[0]
'[ 0.40317503  0.62021637  0.5523216  ... -0.00519936 -0.21220715\n -0.63832164]'

note the 3 dots and the newline before the last number.

You need to turn each of these numbers in a column. Fix this and let me know if it works.

jrzaurin / pytorch-widedeep

How to handle inputs in the pytorch-widedeep library if the dataset already contains vectorized encodings for text and image columns? #223