intel-analytics / analytics-zoo

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
https://analytics-zoo.readthedocs.io/
Apache License 2.0
16 stars 3 forks source link

File NotFoundError for Wechat Baseline Model Using AutoSearch #229

Closed Elena-Qiu closed 3 years ago

Elena-Qiu commented 3 years ago

In the wechat baseline model using autosearch, I add the search space about crossed columns, where I implement as follows

def get_search_space():
    from zoo.orca.automl import hp
    cross_column_candidates = []
    cross_column_candidates.append(["userid", "bgm_singer_id", "bgm_song_id"])
    cross_column_candidates.append(["userid", "authorid"])
    cross_column_candidates.append(["userid", "feedid"])
    cross_column_candidates.append(["authorid", "bgm_singer_id", "bgm_song_id"])
    cross_column_candidates.append(["feedid", "bgm_singer_id", "bgm_song_id"])
    cross_column_candidates.append(["feedid", "authorid"])
    return {
        'embed_dim': hp.choice([20, 40, 80]),
        'embed_l2': None,
        'learning_rate': hp.choice([0.001, 0.01, 0.05, 0.1]),
        "batch_size": hp.choice([64, 128, 256]),
        "crossed_columns": hp.sample_from(lambda :
                                            list(np.random.choice(
                                                cross_column_candidates,
                                                size=np.random.randint(
                                                    low=3,
                                                    high=len(cross_column_candidates)),
                                                replace=False)))
    }

Then the filename of trails under auto_wnd seems too long and is not complete, it is like

/auto_wnd/wnd/train_func_a9181_00000_0_batch_size=256,crossed_columns=[['userid', 'bgm_singer_id', 'bgm_song_id'], ['userid', 'feedid'], ['feedi_2021-06-10_18-28-56

Then, I will get the following error:

INFO:tensorflow:Done calling model_fn.
2021-06-10 19:22:13,395 - INFO - Done calling model_fn.
INFO:tensorflow:Warm-starting with WarmStartSettings: WarmStartSettings(ckpt_to_initialize_from="/home/yuqing/repo/friesian/wechat2021/baseline/wideANDdeep/data/auto_wnd/wnd/train_func_a9181_00000_0_batch_size=256,crossed_columns=[['userid', 'bgm_singer_id', 'bgm_song_id'], ['userid', 'feedid'], ['feedi_2021-06-10_18-28-56/./data/model_ckpt/offline_train/read_comment/model.ckpt-2422", vars_to_warm_start='.*', var_name_to_vocab_info={}, var_name_to_prev_var_name={})
2021-06-10 19:22:13,395 - INFO - Warm-starting with WarmStartSettings: WarmStartSettings(ckpt_to_initialize_from="/home/yuqing/repo/friesian/wechat2021/baseline/wideANDdeep/data/auto_wnd/wnd/train_func_a9181_00000_0_batch_size=256,crossed_columns=[['userid', 'bgm_singer_id', 'bgm_song_id'], ['userid', 'feedid'], ['feedi_2021-06-10_18-28-56/./data/model_ckpt/offline_train/read_comment/model.ckpt-2422", vars_to_warm_start='.*', var_name_to_vocab_info={}, var_name_to_prev_var_name={})
INFO:tensorflow:Warm-starting from: ("/home/yuqing/repo/friesian/wechat2021/baseline/wideANDdeep/data/auto_wnd/wnd/train_func_a9181_00000_0_batch_size=256,crossed_columns=[['userid', 'bgm_singer_id', 'bgm_song_id'], ['userid', 'feedid'], ['feedi_2021-06-10_18-28-56/./data/model_ckpt/offline_train/read_comment/model.ckpt-2422",)
2021-06-10 19:22:13,395 - INFO - Warm-starting from: ("/home/yuqing/repo/friesian/wechat2021/baseline/wideANDdeep/data/auto_wnd/wnd/train_func_a9181_00000_0_batch_size=256,crossed_columns=[['userid', 'bgm_singer_id', 'bgm_song_id'], ['userid', 'feedid'], ['feedi_2021-06-10_18-28-56/./data/model_ckpt/offline_train/read_comment/model.ckpt-2422",)
INFO:tensorflow:Warm-starting variables only in TRAINABLE_VARIABLES.
2021-06-10 19:22:13,395 - INFO - Warm-starting variables only in TRAINABLE_VARIABLES.
Traceback (most recent call last):
  File "baseline_tune.py", line 372, in <module>
    ids, logits, action_uauc = best_model.evaluate(df=eval_df)
  File "baseline_tune.py", line 137, in evaluate
    predicts_df = pd.DataFrame.from_dict(predicts)
  File "/home/yuqing/anaconda3/envs/tune/lib/python3.6/site-packages/pandas/core/frame.py", line 1309, in from_dict
    return cls(data, index=index, columns=columns, dtype=dtype)
  File "/home/yuqing/anaconda3/envs/tune/lib/python3.6/site-packages/pandas/core/frame.py", line 502, in __init__
    data = list(data)
  File "/home/yuqing/anaconda3/envs/tune/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 622, in predict
    self._maybe_warm_start(checkpoint_path)
  File "/home/yuqing/anaconda3/envs/tune/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1630, in _maybe_warm_start
    warm_starting_util.warm_start(*self._warm_start_settings)
  File "/home/yuqing/anaconda3/envs/tune/lib/python3.6/site-packages/tensorflow/python/training/warm_starting_util.py", line 476, in warm_start
    checkpoint_utils.init_from_checkpoint(ckpt_to_initialize_from, vocabless_vars)
  File "/home/yuqing/anaconda3/envs/tune/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_utils.py", line 291, in init_from_checkpoint
    init_from_checkpoint_fn)
  File "/home/yuqing/anaconda3/envs/tune/lib/python3.6/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1684, in merge_call
    return self._merge_call(merge_fn, args, kwargs)
  File "/home/yuqing/anaconda3/envs/tune/lib/python3.6/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1691, in _merge_call
    return merge_fn(self._strategy, *args, **kwargs)
  File "/home/yuqing/anaconda3/envs/tune/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_utils.py", line 286, in <lambda>
    ckpt_dir_or_file, assignment_map)
  File "/home/yuqing/anaconda3/envs/tune/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_utils.py", line 297, in _init_from_checkpoint
    reader = load_checkpoint(ckpt_dir_or_file)
  File "/home/yuqing/anaconda3/envs/tune/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_utils.py", line 66, in load_checkpoint
    return pywrap_tensorflow.NewCheckpointReader(filename)
  File "/home/yuqing/anaconda3/envs/tune/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 636, in NewCheckpointReader
    return CheckpointReader(compat.as_bytes(filepattern))
  File "/home/yuqing/anaconda3/envs/tune/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 648, in __init__
    this = _pywrap_tensorflow_internal.new_CheckpointReader(filename)
tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /home/yuqing/repo/friesian/wechat2021/baseline/wideANDdeep/data/auto_wnd/wnd/train_func_a9181_00000_0_batch_size=256,crossed_columns=[['userid', 'bgm_singer_id', 'bgm_song_id'], ['userid', 'feedid'], ['feedi_2021-06-10_18-28-56/./data/model_ckpt/offline_train/read_comment/model.ckpt-2422
Stopping orca context

Is there any way to not show the value of crossed_columns in the filename? Thanks!

shanyu-sys commented 3 years ago

The reason might be it cannot identify the file name with unaligned number of "[" and "]" during restore.

Elena-Qiu commented 3 years ago

Oh. Maybe I can represent each subset of names of the cross columns using a number and the value of crossed_columns will be like crossed_columns=[3, 2, 1] and then I can get the real names of cross columns using this number in the 'get_feature_columns' function.

Elena-Qiu commented 3 years ago

It seems still not to work

tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /home/yuqing/repo/friesian/wechat2021/baseline/wideANDdeep/data/auto_wnd/wnd/train_func_97b76_00003_3_batch_size=64,crossed_columns=[3, 2, 1, 5, 4],embed_dim=40,learning_rate=0.1_2021-06-11_11-03-27/./data/model_ckpt/offline_train/read_comment/model.ckpt-1