irecsys / DeepCARSKit

A Deep Learning Based Context-Aware Recommendation Library
https://carskit.github.io/
MIT License
19 stars 4 forks source link

Error when running run.py script #3

Closed AnuragAnalog closed 1 year ago

AnuragAnalog commented 1 year ago

I have cloned the repository and want to test the code, so I have started following the instructions in the README file and am getting some errors. (I cloned this repo one day before posting this issue so that you can get the exact version to reproduce the error)

Steps to reproduce Error

Error is at the end of the bash area

Some more additional information about the hardware and software Software

Hardware

Error

GPU availability:  True
Num of GPU:  1
NVIDIA A2
Current GPU index:  0

18 Feb 12:52    INFO  
General Hyper Parameters:
gpu_id = 0
use_gpu = True
seed = 2022
state = INFO
reproducibility = True
data_path = dataset/tripadvisor
checkpoint_dir = saved
show_progress = False
save_dataset = False
dataset_save_path = None
save_dataloaders = False
dataloaders_save_path = None
log_wandb = False

Training Hyper Parameters:
epochs = 50
train_batch_size = 500
learner = adam
learning_rate = 0.01
train_neg_sample_args = {'distribution': 'none', 'sample_num': 'none', 'alpha': 'none', 'dynamic': False, 'candidate_num': 0}
eval_step = 1
stopping_step = 10
clip_grad_norm = None
weight_decay = 0.0
loss_decimal_place = 4

Evaluation Hyper Parameters:
eval_args = {'split': {'CV': 5}, 'group_by': 'user', 'mode': 'labeled', 'order': 'RO'}
repeatable = False
metrics = ['MAE', 'RMSE', 'AUC']
topk = [10, 20, 30]
valid_metric = MAE
valid_metric_bigger = False
eval_batch_size = 409600
metric_decimal_place = 4

Dataset Hyper Parameters:
field_separator = ,
seq_separator =  
USER_ID_FIELD = user_id
ITEM_ID_FIELD = item_id
RATING_FIELD = rating
TIME_FIELD = timestamp
seq_len = None
LABEL_FIELD = label
threshold = {'rating': 0}
NEG_PREFIX = neg_
load_col = None
unload_col = None
unused_col = None
additional_feat_suffix = None
rm_dup_inter = None
val_interval = None
filter_inter_by_user_or_item = True
user_inter_num_interval = [0,inf)
item_inter_num_interval = [0,inf)
alias_of_user_id = None
alias_of_item_id = None
alias_of_entity_id = None
alias_of_relation_id = None
preload_weight = None
normalize_field = None
normalize_all = None
ITEM_LIST_LENGTH_FIELD = item_length
LIST_SUFFIX = _list
MAX_ITEM_LIST_LENGTH = 50
POSITION_FIELD = position_id
HEAD_ENTITY_ID_FIELD = head_id
TAIL_ENTITY_ID_FIELD = tail_id
RELATION_ID_FIELD = relation_id
ENTITY_ID_FIELD = entity_id
benchmark_filename = None

Other Hyper Parameters: 
worker = 0
wandb_project = recbole
shuffle = True
require_pow = False
enable_amp = False
enable_scaler = False
transform = None
numerical_features = []
discretization = None
kg_reverse_r = False
entity_kg_num_interval = [0,inf)
relation_kg_num_interval = [0,inf)
MODEL_TYPE = ModelType.CONTEXT
CONTEXT_SITUATION_FIELD = contexts
USER_CONTEXT_FIELD = uc_id
neg_sampling = None
mf_embedding_size = 64
mlp_embedding_size = 64
mlp_hidden_size = [128, 64, 32]
dropout_prob = 0.1
mf_train = True
mlp_train = True
embedding_size = 64
ranking = False
sigmoid = False
ranking_valid_metric = Recall@10
ranking_metrics = ['Precision', 'Recall', 'NDCG', 'MRR', 'MAP']
err_valid_metric = MAE
err_metrics = ['MAE', 'RMSE', 'AUC']
MODEL_INPUT_TYPE = InputType.POINTWISE
eval_type = EvaluatorType.VALUE
single_spec = True
local_rank = 0
device = cuda
eval_neg_sample_args = {'distribution': 'none', 'sample_num': 'none'}

18 Feb 12:52    INFO  tripadvisor
The number of users: 2372
Average actions of users: 5.978490088570224
The number of items: 2270
Average actions of items: 6.24724548259145
The number of inters: 14175
The sparsity of the dataset: 99.73674142529214%
Remain Fields: ['user_id', 'item_id', 'rating', 'trip', 'contexts', 'uc_id']
Context dimension - trip: 6 values: : ['BUSINESS' 'COUPLES' 'FAMILY' 'FRIENDS' 'SOLO' '[PAD]']
Traceback (most recent call last):
  File "/scratch/apeddi/DeepCARSKit/run.py", line 32, in <module>
    run(config_file_list=config_list)
  File "/scratch/apeddi/DeepCARSKit/deepcarskit/quick_start/quick_start.py", line 96, in run
    train_data, valid_data = data_preparation(config, dataset)
  File "/scratch/apeddi/DeepCARSKit/deepcarskit/data/utils.py", line 132, in data_preparation
    train_sampler, valid_sampler = create_samplers(config, dataset, built_datasets[fold])
  File "/scratch/apeddi/DeepCARSKit/deepcarskit/data/utils.py", line 301, in create_samplers
    if train_neg_sample_args['strategy'] != 'none':
KeyError: 'strategy'

@irecsys Could you please help me in resolving this error?

irecsys commented 1 year ago

The reason you had this error is because the version of RecBole. The latest Recbole is no longer compatible with this DeepCARSKit. You should uninstall your RecBole library, and reinstall Recbole v1.0.0

AnuragAnalog commented 1 year ago

Yeah, along with downgrade of recbole I have used all the modules with the least possible version specified in the requirements file.

And I can run it now.

AnuragAnalog commented 1 year ago

After all the installation of the third-party libraries, when I try to rerun the script, I get a weird error.

Error

GPU availability:  True
Num of GPU:  1
Tesla T4
Current GPU index:  0
20 Feb 12:00    INFO  
General Hyper Parameters:
gpu_id = 0
use_gpu = True
seed = 2022
state = INFO
reproducibility = True
data_path = dataset/tripadvisor
show_progress = False
save_dataset = False
save_dataloaders = False
benchmark_filename = None

Training Hyper Parameters:
checkpoint_dir = saved
epochs = 50
train_batch_size = 500
learner = adam
learning_rate = 0.01
eval_step = 1
stopping_step = 10
clip_grad_norm = None
weight_decay = 0.0
loss_decimal_place = 4

Evaluation Hyper Parameters:
eval_args = {'split': {'CV': 5}, 'group_by': 'user', 'mode': 'labeled', 'order': 'RO'}
metrics = ['MAE', 'RMSE', 'AUC']
topk = [10, 20, 30]
valid_metric = MAE
valid_metric_bigger = False
eval_batch_size = 409600
metric_decimal_place = 4

Dataset Hyper Parameters:
field_separator = ,
seq_separator =  
USER_ID_FIELD = user_id
ITEM_ID_FIELD = item_id
RATING_FIELD = rating
TIME_FIELD = timestamp
seq_len = None
LABEL_FIELD = label
threshold = {'rating': 0}
NEG_PREFIX = neg_
load_col = None
unload_col = None
unused_col = None
additional_feat_suffix = None
rm_dup_inter = None
val_interval = None
filter_inter_by_user_or_item = True
user_inter_num_interval = [0,inf)
item_inter_num_interval = [0,inf)
alias_of_user_id = None
alias_of_item_id = None
alias_of_entity_id = None
alias_of_relation_id = None
preload_weight = None
normalize_field = None
normalize_all = None
ITEM_LIST_LENGTH_FIELD = item_length
LIST_SUFFIX = _list
MAX_ITEM_LIST_LENGTH = 50
POSITION_FIELD = position_id
HEAD_ENTITY_ID_FIELD = head_id
TAIL_ENTITY_ID_FIELD = tail_id
RELATION_ID_FIELD = relation_id
ENTITY_ID_FIELD = entity_id

Other Hyper Parameters: 
neg_sampling = None
repeatable = False
MODEL_TYPE = ModelType.CONTEXT
CONTEXT_SITUATION_FIELD = contexts
USER_CONTEXT_FIELD = uc_id
mf_embedding_size = 64
mlp_embedding_size = 64
mlp_hidden_size = [128, 64, 32]
dropout_prob = 0.1
mf_train = True
mlp_train = True
embedding_size = 64
ranking = False
sigmoid = False
ranking_valid_metric = Recall@10
ranking_metrics = ['Precision', 'Recall', 'NDCG', 'MRR', 'MAP']
err_valid_metric = MAE
err_metrics = ['MAE', 'RMSE', 'AUC']
MODEL_INPUT_TYPE = InputType.POINTWISE
eval_type = EvaluatorType.VALUE
device = cuda
train_neg_sample_args = {'strategy': 'none'}
eval_neg_sample_args = {'strategy': 'none', 'distribution': 'none'}

20 Feb 12:00    INFO  tripadvisor
The number of users: 2372
Average actions of users: 5.978490088570224
The number of items: 2270
Average actions of items: 6.24724548259145
The number of inters: 14175
The sparsity of the dataset: 99.73674142529214%
Remain Fields: ['user_id', 'item_id', 'rating', 'trip', 'contexts', 'uc_id']
Context dimension - trip: 6 values: : ['BUSINESS' 'COUPLES' 'FAMILY' 'FRIENDS' 'SOLO' '[PAD]']
20 Feb 12:00    INFO  [Training]: train_batch_size = [500] negative sampling: [None]
20 Feb 12:00    INFO  [Evaluation]: eval_batch_size = [409600] eval_args: [{'split': {'CV': 5}, 'group_by': 'user', 'mode': 'labeled', 'order': 'RO'}]
20 Feb 12:00    INFO  [Training]: train_batch_size = [500] negative sampling: [None]
20 Feb 12:00    INFO  [Evaluation]: eval_batch_size = [409600] eval_args: [{'split': {'CV': 5}, 'group_by': 'user', 'mode': 'labeled', 'order': 'RO'}]
20 Feb 12:00    INFO  [Training]: train_batch_size = [500] negative sampling: [None]
20 Feb 12:00    INFO  [Evaluation]: eval_batch_size = [409600] eval_args: [{'split': {'CV': 5}, 'group_by': 'user', 'mode': 'labeled', 'order': 'RO'}]
20 Feb 12:00    INFO  [Training]: train_batch_size = [500] negative sampling: [None]
20 Feb 12:00    INFO  [Evaluation]: eval_batch_size = [409600] eval_args: [{'split': {'CV': 5}, 'group_by': 'user', 'mode': 'labeled', 'order': 'RO'}]
20 Feb 12:00    INFO  [Training]: train_batch_size = [500] negative sampling: [None]
20 Feb 12:00    INFO  [Evaluation]: eval_batch_size = [409600] eval_args: [{'split': {'CV': 5}, 'group_by': 'user', 'mode': 'labeled', 'order': 'RO'}]
20 Feb 12:00    INFO  Loaded context variables: trip, with context situation ID: contexts
20 Feb 12:00    INFO  Loaded context variables: trip, with context situation ID: contexts
20 Feb 12:00    INFO  Loaded context variables: trip, with context situation ID: contexts
20 Feb 12:00    INFO  Loaded context variables: trip, with context situation ID: contexts
20 Feb 12:00    INFO  Loaded context variables: trip, with context situation ID: contexts
20 Feb 12:00    INFO  epoch 0 training [time: 0.39s, train loss: 110.3815]
20 Feb 12:00    INFO  epoch 0 evaluating [time: 0.00s, valid_score: 1.026400]
20 Feb 12:00    INFO  valid result: 
mae : 1.0264    rmse : 1.2131    auc : 0.4726    
20 Feb 12:00    INFO  Saving current best: saved/NeuCMFii-Feb-20-2023_12-00-04_f5.pth
20 Feb 12:00    INFO  epoch 1 training [time: 0.06s, train loss: 25.2800]
20 Feb 12:00    INFO  epoch 1 evaluating [time: 0.00s, valid_score: 0.792700]
20 Feb 12:00    INFO  valid result: 
mae : 0.7927    rmse : 1.0028    auc : 0.4791    
20 Feb 12:00    INFO  Saving current best: saved/NeuCMFii-Feb-20-2023_12-00-04_f5.pth
20 Feb 12:00    INFO  epoch 2 training [time: 0.06s, train loss: 17.4006]
20 Feb 12:00    INFO  epoch 2 evaluating [time: 0.00s, valid_score: 0.876600]
20 Feb 12:00    INFO  valid result: 
mae : 0.8766    rmse : 1.0998    auc : 0.4543    
20 Feb 12:00    INFO  epoch 3 training [time: 0.06s, train loss: 13.7346]
20 Feb 12:00    INFO  epoch 3 evaluating [time: 0.00s, valid_score: 0.889200]
20 Feb 12:00    INFO  valid result: 
mae : 0.8892    rmse : 1.1228    auc : 0.4588    
20 Feb 12:00    INFO  epoch 4 training [time: 0.06s, train loss: 11.1480]
20 Feb 12:00    INFO  epoch 4 evaluating [time: 0.00s, valid_score: 0.913700]
20 Feb 12:00    INFO  valid result: 
mae : 0.9137    rmse : 1.16    auc : 0.4764    
20 Feb 12:00    INFO  epoch 5 training [time: 0.06s, train loss: 8.5669]
20 Feb 12:00    INFO  epoch 5 evaluating [time: 0.00s, valid_score: 0.958200]
20 Feb 12:00    INFO  valid result: 
mae : 0.9582    rmse : 1.2061    auc : 0.4538    
20 Feb 12:00    INFO  epoch 6 training [time: 0.06s, train loss: 6.1607]
20 Feb 12:00    INFO  epoch 6 evaluating [time: 0.00s, valid_score: 0.951300]
20 Feb 12:00    INFO  valid result: 
mae : 0.9513    rmse : 1.2087    auc : 0.4664    
20 Feb 12:00    INFO  epoch 7 training [time: 0.06s, train loss: 5.1433]
20 Feb 12:00    INFO  epoch 7 evaluating [time: 0.00s, valid_score: 0.960300]
20 Feb 12:00    INFO  valid result: 
mae : 0.9603    rmse : 1.2121    auc : 0.4661    
20 Feb 12:00    INFO  epoch 8 training [time: 0.06s, train loss: 4.5258]
20 Feb 12:00    INFO  epoch 8 evaluating [time: 0.00s, valid_score: 0.976300]
20 Feb 12:00    INFO  valid result: 
mae : 0.9763    rmse : 1.2282    auc : 0.477    
20 Feb 12:00    INFO  epoch 9 training [time: 0.06s, train loss: 3.9742]
20 Feb 12:00    INFO  epoch 9 evaluating [time: 0.00s, valid_score: 0.992400]
20 Feb 12:00    INFO  valid result: 
mae : 0.9924    rmse : 1.2375    auc : 0.4727    
20 Feb 12:00    INFO  epoch 10 training [time: 0.06s, train loss: 3.5397]
20 Feb 12:00    INFO  epoch 10 evaluating [time: 0.00s, valid_score: 0.991200]
20 Feb 12:00    INFO  valid result: 
mae : 0.9912    rmse : 1.2424    auc : 0.4949    
20 Feb 12:00    INFO  epoch 11 training [time: 0.06s, train loss: 3.4084]
20 Feb 12:00    INFO  epoch 11 evaluating [time: 0.00s, valid_score: 0.988300]
20 Feb 12:00    INFO  valid result: 
mae : 0.9883    rmse : 1.2413    auc : 0.4773    
20 Feb 12:00    INFO  epoch 12 training [time: 0.06s, train loss: 3.0196]
20 Feb 12:00    INFO  epoch 12 evaluating [time: 0.00s, valid_score: 0.961800]
20 Feb 12:00    INFO  valid result: 
mae : 0.9618    rmse : 1.2139    auc : 0.4782    
20 Feb 12:00    INFO  Finished training, best eval result in epoch 1
20 Feb 12:00    INFO  Fold 5 completed: : {'mae': 0.7927, 'rmse': 1.0028, 'auc': 0.4791}
Traceback (most recent call last):
  File "/scratch/apeddi/cars/lib/python3.9/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 192, in makedirs
    os.makedirs(path)
  File "/opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/python-3.9.9-jh/lib/python3.9/os.py", line 225, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: 'log_tensorboard/tripadvisor-NeuCMFii-Feb-20-2023_12-00-02'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/scratch/apeddi/DeepCARSKit/run.py", line 32, in <module>
    run(config_file_list=config_list)
  File "/scratch/apeddi/DeepCARSKit/deepcarskit/quick_start/quick_start.py", line 111, in run
    rsts = pool.map(eval_folds, list_train_test)
  File "/opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/python-3.9.9-jh/lib/python3.9/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/python-3.9.9-jh/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
  File "/opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/python-3.9.9-jh/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/python-3.9.9-jh/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/scratch/apeddi/DeepCARSKit/deepcarskit/quick_start/quick_start.py", line 50, in eval_folds
    trainer = get_trainer(config['MODEL_TYPE'], config['model'])(config, model)
  File "/scratch/apeddi/DeepCARSKit/deepcarskit/trainer/trainer.py", line 39, in __init__
    super(CARSTrainer, self).__init__(config, model)
  File "/scratch/apeddi/cars/lib/python3.9/site-packages/recbole/trainer/trainer.py", line 80, in __init__
    self.tensorboard = get_tensorboard(self.logger)
  File "/scratch/apeddi/cars/lib/python3.9/site-packages/recbole/utils/utils.py", line 219, in get_tensorboard
    writer = SummaryWriter(dir_path)
  File "/scratch/apeddi/cars/lib/python3.9/site-packages/torch/utils/tensorboard/writer.py", line 247, in __init__
    self._get_file_writer()
  File "/scratch/apeddi/cars/lib/python3.9/site-packages/torch/utils/tensorboard/writer.py", line 277, in _get_file_writer
    self.file_writer = FileWriter(
  File "/scratch/apeddi/cars/lib/python3.9/site-packages/torch/utils/tensorboard/writer.py", line 76, in __init__
    self.event_writer = EventFileWriter(
  File "/scratch/apeddi/cars/lib/python3.9/site-packages/tensorboard/summary/writer/event_file_writer.py", line 73, in __init__
    tf.io.gfile.makedirs(logdir)
  File "/scratch/apeddi/cars/lib/python3.9/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 665, in makedirs
    return get_filesystem(path).makedirs(path)
  File "/scratch/apeddi/cars/lib/python3.9/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 194, in makedirs
    raise errors.AlreadyExistsError(
tensorboard.compat.tensorflow_stub.errors.AlreadyExistsError: Directory already exists

Occasionally the script runs correctly, without any error, but most often, it throws this error.

I have tried deleting the directory log_tensorboard from the current directory and rerunning it, but the error persists.

I have also traced the code in files, but I couldn't find the place which logs the results in the directory(so I can change the filename format to avoid the error).

@irecsys Could you please help me with the issue?