uclnlp / jack

Jack the Reader
MIT License
257 stars 82 forks source link

Issue training DAM wrt. word embedding matrix dimensions #314

Closed pminervini closed 6 years ago

pminervini commented 6 years ago

The training process in DAM (SNLI) does not really start, due to lookup operations to positions in the word embedding matrix that do not exist. For reproducing the error: python3 bin/jack-train.py with config='./conf/dam.yaml'.

I'm taking a chance for giving a look at the DAM code in jack and refreshing it.

PS: DAM requires an initial NULL token and a UNK token: are those supported by the current Vocab ?


$ python3 bin/jack-train.py with config='./conf/dam.yaml'                                                                                                                                
WARNING - jack - No observers have been added to this run
INFO - jack - Running command 'main'
INFO - jack - Started
INFO - jack-train.py - TRAINING
INFO - jack-train.py - loaded train/dev/test data
INFO - jack.io.embeddings.glove - Loading GloVe vectors ..
INFO - jack.io.embeddings.glove - Loading GloVe vectors completed.
INFO - jack-train.py - loaded pre-trained embeddings (data/GloVe/glove.840B.300d.txt)
INFO - jack-train.py - Time since last checkpoint : 1.7min
INFO - jack - Running command 'print_config'
INFO - jack - Started
Configuration (modified, added, typechanged, doc):
  batch_size = 32
  clip_value = 0.0
  config = './conf/dam.yaml'
  debug = False
  debug_examples = 10
  description = 'A configuration inheriting from the default jack.yaml\n'
  dev = 'data/SNLI/snli_1.0/snli_1.0_dev.jsonl'
  dev_batch_size = 128
  dropout = 0.5
  embedding_file = 'data/GloVe/glove.840B.300d.txt'
  embedding_format = 'glove'
  epochs = 400
  experiments_db = './out/experiments.db'
  l2 = 0.0
  learning_rate = 0.001
  learning_rate_decay = 0.5
  loader = 'snli'
  log_interval = 100
  lowercase = True
  model = 'dam_snli_reader'
  model_dir = './dam_snli_reader'
  name = None
  normalize_pretrain = False
  optimizer = 'adam'
  output_dir = './out/'
  parent_config = './conf/jack.yaml'
  prune = False
  repr_dim = 300
  repr_dim_input = 300
  seed = 1337
  tensorboard_folder = None
  test = None
  train = 'data/SNLI/snli_1.0/snli_1.0_train.jsonl'
  train_pretrain = False
  validation_interval = None
  vocab_from_embeddings = True
  vocab_maxsize = 1000000000000
  vocab_minfreq = 2
  vocab_sep = True
  with_char_embeddings = True
  write_metrics_to = None
INFO - jack - Completed after 0:00:00
2017-10-24 22:49:30.559456: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-24 22:49:30.559480: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-24 22:49:30.559487: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on yourmachine and could speed up CPU computations.
2017-10-24 22:49:30.559493: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-24 22:49:30.559500: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on yourmachine and could speed up CPU computations.
INFO - jack-train.py - Time since last checkpoint : 0.0035min
INFO - jack.core.reader - Setting up data and model...
INFO - jack.readers.natural_language_inference.decomposable_attention - Building the Attend graph ..
INFO - jack.readers.natural_language_inference.decomposable_attention - Building the Compare graph ..
INFO - jack.readers.natural_language_inference.decomposable_attention - Building the Aggregate graph ..
INFO - jack.core.reader - Start training...
ERROR - jack - Failed after 0:06:09!
Traceback (most recent calls WITHOUT Sacred internals):
  File "bin/jack-train.py", line 154, in main
    jtrain(reader, train_data, test_data, dev_data, configuration, debug=debug)
  File "/home/jack/.pyenv/versions/3.6.3/lib/python3.6/site-packages/jack-0.1.0-py3.6.egg/jack/train_reader.py", line 94, in train
    l2=l2, clip=clip_value, clip_op=tf.clip_by_value)
  File "/home/jack/.pyenv/versions/3.6.3/lib/python3.6/site-packages/jack-0.1.0-py3.6.egg/jack/core/reader.py", line 264, in train
    current_loss, _ = self.session.run([loss, min_op], feed_dict=feed_dict)
  File "/home/jack/.pyenv/versions/3.6.3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 895, in run
    run_metadata_ptr)
  File "/home/jack/.pyenv/versions/3.6.3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1124, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/jack/.pyenv/versions/3.6.3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
    options, run_metadata)
  File "/home/jack/.pyenv/versions/3.6.3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[13,5] = 2196016 is not in [0, 2196015)
         [[Node: dam_snli_reader/embedding_lookup = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:@dam_snli_reader/emb_Q"], validate_indices=true, _device="/job:localhost/replica:0/task:0/cpu:0"](dam_snli_reader/emb_Q/read, _arg_dam_snli_reader/question_0_1)]]

Caused by op 'dam_snli_reader/embedding_lookup', defined at:
  File "bin/jack-train.py", line 60, in <module>
    @ex.automain
  File "/home/jack/workspace/jack/.eggs/sacred-0.7.1-py3.6.egg/sacred/experiment.py", line 131, in automain
    self.run_commandline()
  File "/home/jack/workspace/jack/.eggs/sacred-0.7.1-py3.6.egg/sacred/experiment.py", line 245, in run_commandline
    return self.run(cmd_name, config_updates, named_configs, {}, args)
  File "/home/jack/workspace/jack/.eggs/sacred-0.7.1-py3.6.egg/sacred/experiment.py", line 189, in run
    run()
  File "/home/jack/workspace/jack/.eggs/sacred-0.7.1-py3.6.egg/sacred/run.py", line 229, in __call__
    self.result = self.main_function(*args)
  File "/home/jack/workspace/jack/.eggs/sacred-0.7.1-py3.6.egg/sacred/config/captured_function.py", line 47, in captured_function
    result = wrapped(*args, **kwargs)
  File "bin/jack-train.py", line 154, in main
    jtrain(reader, train_data, test_data, dev_data, configuration, debug=debug)
  File "/home/jack/.pyenv/versions/3.6.3/lib/python3.6/site-packages/jack-0.1.0-py3.6.egg/jack/train_reader.py", line 94, in train
    l2=l2, clip=clip_value, clip_op=tf.clip_by_value)
  File "/home/jack/.pyenv/versions/3.6.3/lib/python3.6/site-packages/jack-0.1.0-py3.6.egg/jack/core/reader.py", line 237, in train
    self.setup_from_data(training_set, is_training=True)
  File "/home/jack/.pyenv/versions/3.6.3/lib/python3.6/site-packages/jack-0.1.0-py3.6.egg/jack/core/reader.py", line 141, in setup_from_data
    self.model_module.setup(is_training)
  File "/home/jack/.pyenv/versions/3.6.3/lib/python3.6/site-packages/jack-0.1.0-py3.6.egg/jack/core/model_module.py", line 162, in setup
    self.shared_resources, *[self._tensors[port] for port in self.input_ports])
  File "/home/jack/.pyenv/versions/3.6.3/lib/python3.6/site-packages/jack-0.1.0-py3.6.egg/jack/readers/multiple_choice/shared.py", line 73, in create_output
    shared_resources.config['answer_size'])
  File "/home/jack/.pyenv/versions/3.6.3/lib/python3.6/site-packages/jack-0.1.0-py3.6.egg/jack/readers/natural_language_inference/decomposable_attention.py", line 24, in forward_pass
    question_embedding = tf.nn.embedding_lookup(self.question_embedding_matrix, question)
  File "/home/jack/.pyenv/versions/3.6.3/lib/python3.6/site-packages/tensorflow/python/ops/embedding_ops.py", line 294, in embedding_lookup
    transform_fn=None)
  File "/home/jack/.pyenv/versions/3.6.3/lib/python3.6/site-packages/tensorflow/python/ops/embedding_ops.py", line 123, in _embedding_lookup_and_transform
    result = _gather_and_clip(params[0], ids, max_norm, name=name)
  File "/home/jack/.pyenv/versions/3.6.3/lib/python3.6/site-packages/tensorflow/python/ops/embedding_ops.py", line 57, in _gather_and_clip
    embs = array_ops.gather(params, ids, name=name)
  File "/home/jack/.pyenv/versions/3.6.3/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 2409, in gather
    validate_indices=validate_indices, name=name)
  File "/home/jack/.pyenv/versions/3.6.3/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 1219, in gather
    validate_indices=validate_indices, name=name)
  File "/home/jack/.pyenv/versions/3.6.3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/home/jack/.pyenv/versions/3.6.3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/home/jack/.pyenv/versions/3.6.3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1204, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): indices[13,5] = 2196016 is not in [0, 2196015)
         [[Node: dam_snli_reader/embedding_lookup = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:@dam_snli_reader/emb_Q"], validate_indices=true, _device="/job:localhost/replica:0/task:0/cpu:0"](dam_snli_reader/emb_Q/read, _arg_dam_snli_reader/question_0_1)]]
pminervini commented 6 years ago

The problem seems to arise when I add the vocab_from_embeddings: True flag - @TimDettmers thanks for the suggestion of ablating the options in the config file