Closed stoensin closed 3 months ago
Hey @stoensin
thanks for the comments 😊.
Treat them all as continuous cols
from pytorch_widedeep import Trainer
from pytorch_widedeep.models import TabMlp, WideDeep,
from pytorch_widedeep.preprocessing import TabPreprocessor
if __name__ == "__main__":
# your dataframe with 100s of continuos/numerical columns
df = pd.DataFrame(...)
continuous_cols = [c for c in df.columns]
target = "your_target_col"
target = df[target].values
# Explore other params if you wanted
tab_preprocessor = TabPreprocessor(continuous_cols=continuous_cols)
X_tab = tab_preprocessor.fit_transform(df)
# we could use a "simple" MLP if all we have is a table
tab_mlp = TabMlp(
column_idx=tab_preprocessor.column_idx,
continuous_cols=continuous_cols,
..., # the params you might want
)
# proceed as in the examples (as usual)
....
Thank you for your reply. When I use the method you suggested, torch will give an abnormal error message: -> "TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found object", the picture below is my input data:
the first three fields have been processed by a trained embedding model,The dimension of each vector is 512.
Is there a way in pytorch_widedeep to directly apply these vectorized features?
Can you please paste here a small sample of the dataframe? or a dataframe that would be equivalent and you have no problem sharing?
in the meantime here you have a fully functional example with a df
that has all numerical columns apart from a binary target
import numpy as np
import pandas as pd
import torch
from pytorch_widedeep import Trainer
from pytorch_widedeep.metrics import Accuracy
from pytorch_widedeep.models import TabMlp, WideDeep
from pytorch_widedeep.preprocessing import TabPreprocessor
use_cuda = torch.cuda.is_available()
if __name__ == "__main__":
np.random.seed(42)
data = np.random.randn(100, 10)
target = np.random.randint(0, 2, 100)
# Combine the data into a DataFrame
df = pd.DataFrame(data, columns=[f"col_{i+1}" for i in range(10)])
# Add the target column
df["target"] = target
continuous_cols = [f"col_{i+1}" for i in range(10)]
target = "target"
target = df[target].values
tab_preprocessor = TabPreprocessor(continuous_cols=continuous_cols)
X_tab = tab_preprocessor.fit_transform(df)
tab_mlp = TabMlp(
column_idx=tab_preprocessor.column_idx,
continuous_cols=continuous_cols,
mlp_hidden_dims=[32, 16],
)
model = WideDeep(deeptabular=tab_mlp)
trainer = Trainer(
model,
objective="binary",
metrics=[Accuracy],
)
trainer.fit(
X_tab=X_tab,
target=target,
n_epochs=2,
batch_size=16,
val_split=0.2,
)
test (1).csv this is my test data.
and the error is :
what machine are you using?
I will look into this later, in the meantime, can you try setting num_workers=1
when you instantiate the Trainer?
trainer = Trainer(
model,
objective="binary",
metrics=[Accuracy],
num_workers=1,
)
my machine is Linux h3c 5.15.0-105-generic #115-Ubuntu SMP Mon Apr 15 09:52:04 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Ubuntu 22.04
while setting to trainer = Trainer( model, objective="rmse", num_workers=1, )
it meet the same errors:
`
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/root/anaconda3/lib/python3.11/site-packages/torch/utils/data/_utils/collate.py", line 154, in collate
clone.update({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem})
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/lib/python3.11/site-packages/torch/utils/data/_utils/collate.py", line 154, in
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.11/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index) # type: ignore[possibly-undefined]
^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/lib/python3.11/site-packages/torch/utils/data/_utils/collate.py", line 316, in default_collate
return collate(batch, collate_fn_map=default_collate_fn_map)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/lib/python3.11/site-packages/torch/utils/data/_utils/collate.py", line 173, in collate
return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed] # Backwards compatibility.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/lib/python3.11/site-packages/torch/utils/data/_utils/collate.py", line 173, in
@stoensin
in your dataframe, the 1st 3 columns are strings when as you read them with pd.read_csv
. Is a list of numbers, but when read by pandas you will see that is not even in the right format
>>> import pandas as pd
>>> df = pd.read_csv("test.csv")
>>> type(df.categories.tolist()[0])
<class 'str'>
>>> df.categories.tolist()[0]
'[ 0.40317503 0.62021637 0.5523216 ... -0.00519936 -0.21220715\n -0.63832164]'
note the 3 dots and the newline
before the last number.
You need to turn each of these numbers in a column. Fix this and let me know if it works.
Thanks to all the developers for their tremendous contributions, this open-source project is fantastic.
As mentioned in the title: when we have already used trained deep learning models to encode some text columns and image models to encode image columns, obtaining their respective embedding vectors, how do we handle such data in the development process of pytorch-widedeep? I am very much looking forward to someone being able to answer my question, thank you!!!