gsbDBI / torch-choice

Choice modeling with PyTorch: logit model and nested logit model
MIT License
39 stars 8 forks source link

getting empty dataset.x_dict #44

Closed akshitabhargava closed 6 months ago

akshitabhargava commented 6 months ago

Hi, I am using the below snippet to create dataset, however, when I print dataset.x_dict, it;s coming out to be and empty dictionary. This is unlike the notebook that I followed, notebook , where dataset.x_dict seems to be automatically created.

`item_index = df[df['choice'] == 1].sort_values(by='case')['alt'].reset_index(drop=True) print(item_index) item_names = ['BRAND_1', 'BRAND_2', 'BRAND_3', 'BRAND_4'] num_items = 4 encoder = dict(zip(item_names, range(num_items))) print(f"{encoder=:}") item_index = item_index.map(lambda x: encoder[x]) item_index = torch.LongTensor(item_index) print(f"{item_index=:}")

cost = utils.pivot3d(df, dim0='case', dim1='alt', values='cost') msrp = utils.pivot3d(df, dim0='case', dim1='alt', values='msrp')

dataset = ChoiceDataset(item_index=item_index, cost_data=cost, msrp_data=msrp, ).to(device) `

this is leading to the below error when I run the code

KeyError Traceback (most recent call last) in 1 start_time = time() ----> 2 run(model, dataset, num_epochs=500, dataset_test=None, batch_size=-1, learning_rate=0.01, model_optimizer="Adam") 3 print('Time taken:', time() - start_time)

30 frames /usr/local/lib/python3.8/dist-packages/torch_choice/model/conditional_logit_model.py in forward(self, batch, manual_coef_value_dict) 267 corresponding_observable = var_type.split("[")[0] 268 total_utility += coef( --> 269 x_dict[corresponding_observable], 270 batch.user_index, 271 manual_coef_value=None if manual_coef_value_dict is None else manual_coef_value_dict[var_type])

KeyError: 'cost_data'

PS: Along with the above mentioned notebook, I was also following your official documentation step by step and I am not able to find what am I missing.

Can you please help me with this issue.

Thanks!

shashnkvats commented 6 months ago

Facing the same issue. Any update?

TianyuDu commented 6 months ago

Hey, thank you for spotting this.

The issue was caused by incorrectly naming the keyword variable while you constructed the dataset. Since I don't have the MSRP variable in my example dataset, I will use the cost variable to illustrate this issue. Here, the cost variable is an (item, session)-specific variable; when you build the ChoiceDataset data structure, you would need to name the cost variable with something starting with itemsession_ or sessionitem_ to register the variable as a covariate/feature in the dataset and tell the dataset about the level of variation (e.g., item-specific, user-specific, item-session-specific, etc) of the cost variable.

# this is what you did.
dataset = ChoiceDataset(item_index=item_index,
cost_data=cost,  <--- change to `itemsession_cost_data=cost`.
msrp_data=msrp,
).to(device)

For a complete running example:

import pandas as pd
import torch
from torch_choice.data import ChoiceDataset, utils

if __name__ == "__main__":
    device = "cpu"
    df = pd.read_csv('./tutorials/public_datasets/ModeCanada.csv')  # <------- You may need to change the path depending on where you execute your code, 
    df = df.query('noalt == 4').reset_index(drop=True)
    df.sort_values(by='case', inplace=True)
    df.head()

    item_index = df[df['choice'] == 1].sort_values(by='case')['alt'].reset_index(drop=True)
    print(item_index)
    item_names = ['train', 'car', 'bus', 'air']  # <-- I changed this to fit the data I am using.
    num_items = 4
    encoder = dict(zip(item_names, range(num_items)))
    print(f"{encoder=:}")
    item_index = item_index.map(lambda x: encoder[x])
    item_index = torch.LongTensor(item_index)
    print(f"{item_index=:}")

    cost = utils.pivot3d(df, dim0='case', dim1='alt', values='cost')
    # msrp = utils.pivot3d(df, dim0='case', dim1='alt', values='msrp')  # <-- I removed this since there is not MSRP in this example dataset.

    dataset = ChoiceDataset(
        item_index=item_index,
        itemsession_cost_data=cost,
        # msrp_data=msrp,
    ).to(device)
    print(dataset.x_dict.keys())
    print(dataset.x_dict['itemsession_cost_data'].shape)  # should print torch.Size([4324, 4, 1]), where df['case'].nunique() = 4324.
TianyuDu commented 6 months ago

For technical details, if you look into the ChoiceDataset.x_dict() method, the out dictionary returned only include elements from self.__dict__.items() if the name of this element (i.e., the key) passes the test self._is_attribute(key). The _is_attribute method checks whether the key correspond to a variable that will enter the regression model (e.g., the cost and MSRP in your example). The _is_attribute method simply checks whether the key name starts with certain pattern, for example, if key = item_SOMETHING, the dataset will recognize it as a item-sepcific variable, if key = itemsession_SOMETHING or sessionitem_SOMETHING, the dataset recognizes it as an item-session-sepcific variable. The dataset includes them in the out dictionary returned.

In your code, the key was cost_data and msrp_data and the dataset unfortunately failed to recognize it as an attribute/variable for the model. You would need to change them to things like itemsession_cost_data and itemsession_msrp_data to tell the model that these are variables you use in the regression model.