tensorflow / ranking

Learning to Rank in TensorFlow
Apache License 2.0
2.74k stars 477 forks source link

How long does sparse-model run for ? #107

Closed mulangonando closed 5 years ago

mulangonando commented 5 years ago

HI guys,

I have been running the sparse-model for the last 5 days on GPU server and I can't see anything in my models directory. Meaning it did not get even to the first check point. Anyone has had experience with this?

My features are many though (About 900K) But I still expected to be past the first checkpoint.

Any hints would help here.

Thanks.

ramakumar1729 commented 5 years ago

@mulangonando : What is the list_size you are using? This corresponds to number of documents per query. I would recommend to start with a smaller list_size, and smaller number of features, and work up from here.

mulangonando commented 5 years ago

My list size is 1025 Wow, the number of features s at 1M

mulangonando commented 5 years ago

Lemi see. Challenge is generating the data alone takes me a while. Thanks, lemi try reducing these aspects. Was thinking perhaps I may need to add a saver ?

ramakumar1729 commented 5 years ago

What is the saver you are referring to?

mulangonando commented 5 years ago

Like do I need to add such a line somewhere in the code? Or does the code alraeady take care of the saving of the model?

saver = tf.compat.v1.train.Saver([v1, v2]) saver = tf.compat.v1.train.Saver({v.op.name: v for v in [v1, v2]})

Because I see the Tensorflow tutorial says something like : "Note that you still have to call the save() method to save the model. Passing these arguments to the constructor will not save variables automatically for you."

I am not so experienced with Tensorflow, this task is an intro to it as well

ramakumar1729 commented 5 years ago

Since we use the Estimator API to build models, it does save checkpoints when you train.

mulangonando commented 5 years ago

OK, I'm restructuring the code to use fewer params and shorter list. Just to be clear, the sparse version of the ranking truncates long list, as opposed to first shuffling, then truncate. I want a behavior that maintains the first 10 in the list in the order. then can truncate the rest

mulangonando commented 5 years ago

Hi @ramakumar1729 ,

Actually it seems my code is never starting to train in the first place. The GPUs are enumerated in the output then nothing more -- As well the process in not listed in GPUs [Yes I am using tensorflow-gpu 1.14, cuda 10 but the code never gets placed]

I guess it never begins training, there's this funny Warning :

WARNING:tensorflow:Estimator's model_fn (<function make_groupwise_ranking_fn.._model_fn at 0x7ff7c0cc2b70>) includes params argument, but params are not passed to Estimator.

Could there be something I missed? Why would the params not be passed if I used the code shared? And should I not expect more logging info after the GPU devices have been found and enumerated? Should the process not be listed under nvidia-smi ?

Sorry, too many questions. Just can't seem to get the sparse features code running right.

Thanks

ramakumar1729 commented 5 years ago

@mulangonando Can you clear the contents of your model directory and rerun, and see if you are still getting this error?

Looks like this is related to this issue: https://github.com/tensorflow/models/issues/5790

mulangonando commented 5 years ago

The model directory is always empty. Has never had anything written in it. I keep checking that one.

mulangonando commented 5 years ago

Two major issues here. Process does not get placed on GPU and no output goes to model directory

ramakumar1729 commented 5 years ago

Are you using tensorflow-gpu package instead of tensorflow, for GPU compatibility?

mulangonando commented 5 years ago

Yes, tensorflow-gpu 1.14 , I have tried 1.13 too but similar problem

On Wed, 4 Sep 2019, 20:54 Rama Kumar Pasumarthi, notifications@github.com wrote:

Are you using tensorflow-gpu package instead of tensorflow, for GPU compatibility?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/ranking/issues/107?email_source=notifications&email_token=AFKEYOGVTM74P7F4E2F2M43QH777XA5CNFSM4IPOBMHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD54TJHI#issuecomment-528037021, or mute the thread https://github.com/notifications/unsubscribe-auth/AFKEYOCUME5VSRK2M6LKQHTQH777XANCNFSM4IPOBMHA .

ramakumar1729 commented 5 years ago

Can you share the code? This seems a generic Tensorflow issue, and not particular to TF-Ranking.

mulangonando commented 5 years ago

Below is pretty much the whole code [Mostly from the sample code given]

import six import os import numpy as np import tensorflow_ranking as tfr import tensorflow as tf import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

tf.compat.v1.enable_eager_execution() tf.compat.v1.set_random_seed(1234) tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

TRAINING PARAMETERS

_TRAIN_DATA_PATH = "data/train-1000-framenet-BERT-context-rel.tfrecords" _TEST_DATA_PATH = "data/dev-framenet-BERT-context-rel.tfrecords" _VOCAB_PATH = "data/vocab.txt" _LIST_SIZE = 250 _LABEL_FEATURE = "relevance" _PADDING_LABEL = -1 _LEARNING_RATE = 0.05 _BATCH_SIZE = 32 _HIDDEN_LAYER_DIMS = ["64", "32", "16"] _DROPOUT_RATE = 0.8 _GROUP_SIZE = 5 # Pointwise scoring. _MODEL_DIR = "models/model" _NUM_TRAIN_STEPS = 15 * 1000 _CHECKPOINT_DIR = "chk_points" _EMBEDDING_DIMENSION = 20

def context_feature_columns():

query_column = tf.feature_column.categorical_column_with_vocabulary_file( key="query_tokens", vocabulary_file=_VOCAB_PATH) query_embedding_column = tf.feature_column.embedding_column( query_column, _EMBEDDING_DIMENSION)

answ_column = tf.feature_column.categorical_column_with_vocabulary_file( key="answer_tokens", vocabulary_file=_VOCAB_PATH) answ_embedding_column = tf.feature_column.embedding_column( answ_column, _EMBEDDING_DIMENSION)

qid = tf.feature_column.numeric_column(key="qid",dtype=tf.int64)

overall_bert_encoding_column = tf.feature_column.numeric_column(key="overal_bert_context_encoding_out", shape=768)

context_features = {"query_tokens": query_embedding_column, "answer_tokens": answ_embedding_column, "qid": qid, "overal_bert_context_encoding_out": overall_bert_encoding_column}

return context_features

def example_feature_columns(): """Returns the example feature columns."""

expl_column = tf.feature_column.categorical_column_with_vocabulary_file( key="expl_tokens", vocabulary_file=_VOCAB_PATH) expl_embedding_column = tf.feature_column.embedding_column( expl_column, _EMBEDDING_DIMENSION)

relevance = tf.feature_column.numeric_column(key="relevance",dtype=tf.int64,default_value=_PADDING_LABEL)

examples_features = {"expl_tokens": expl_embedding_column,"relevance": relevance}

for fea in range(1,402212):

id, value = fea.split(":")

  try :
    feat = tf.feature_column.numeric_column(key=str(fea), dtype=tf.int64,default_value=0)
    examples_features[""+str(fea)] = feat
  except :
      continue

  # example[id] = tf.train.Feature(int64_list=tf.train.Int64List(value=[int(value)]))

return examples_features

Reading Input Data using input_fn

def input_fn(path, num_epochs=None): context_feature_spec = tf.feature_column.make_parse_example_spec( context_feature_columns().values()) label_column = tf.feature_column.numeric_column( _LABEL_FEATURE, dtype=tf.int64, default_value=_PADDING_LABEL)

example_feature_spec = tf.feature_column.make_parse_example_spec( list(example_feature_columns().values()) + [label_column]) dataset = tfr.data.build_ranking_dataset( file_pattern=path, data_format=tfr.data.EIE, batch_size=_BATCH_SIZE, list_size=_LIST_SIZE, context_feature_spec=context_feature_spec, example_feature_spec=example_feature_spec, reader=tf.data.TFRecordDataset, shuffle=False, num_epochs=num_epochs) features = tf.data.make_one_shot_iterator(dataset).get_next() label = tf.squeeze(features.pop(_LABEL_FEATURE), axis=2) label = tf.cast(label, tf.float32)

return features, label

Tranform Input

def make_transform_fn(): def _transform_fn(features, mode): """Defines transform_fn.""" example_name = next(six.iterkeys(example_feature_columns())) input_size = tf.shape(input=features[example_name])[1] context_features, example_features = tfr.feature.encode_listwise_features( features=features, input_size=input_size, context_feature_columns=context_feature_columns(), example_feature_columns=example_feature_columns(), mode=mode, scope="transform_layer")

return context_features, example_features

return _transform_fn

Feature Interactions using scoring_fn

def make_score_fn(): """Returns a scoring function to build EstimatorSpec."""

def _score_fn(context_features, group_features, mode, params, config): """Defines the network to score a group of documents.""" with tf.compat.v1.name_scope("input_layer"): context_input = [ tf.compat.v1.layers.flatten(context_features[name]) for name in sorted(context_feature_columns()) ] group_input = [ tf.compat.v1.layers.flatten(group_features[name]) for name in sorted(example_feature_columns()) ] input_layer = tf.concat(context_input + group_input, 1)

is_training = (mode == tf.estimator.ModeKeys.TRAIN)
cur_layer = input_layer
cur_layer = tf.compat.v1.layers.batch_normalization(
  cur_layer,
  training=is_training,
  momentum=0.99)

for i, layer_width in enumerate(int(d) for d in _HIDDEN_LAYER_DIMS):
  cur_layer = tf.compat.v1.layers.dense(cur_layer, units=layer_width)
  cur_layer = tf.compat.v1.layers.batch_normalization(
    cur_layer,
    training=is_training,
    momentum=0.99)
  cur_layer = tf.nn.relu(cur_layer)
  cur_layer = tf.compat.v1.layers.dropout(
      inputs=cur_layer, rate=_DROPOUT_RATE, training=is_training)
logits = tf.compat.v1.layers.dense(cur_layer, units=_GROUP_SIZE)
return logits

return _score_fn

Losses, Metrics and Ranking Head

Evaluation Metrics

def eval_metric_fns(): metric_fns = {} metric_fns.update({ "metric/ndcg@%d" % topn: tfr.metrics.make_ranking_metric_fn( tfr.metrics.RankingMetricKey.NDCG, topn=topn) for topn in [1, 3, 5, 10] })

return metric_fns

_LOSS = tfr.losses.RankingLossKey.APPROX_NDCG_LOSS loss_fn = tfr.losses.make_loss_fn(_LOSS)

Ranking Head

optimizer = tf.compat.v1.train.AdagradOptimizer( learning_rate=_LEARNING_RATE)

def _train_op_fn(loss): """Defines train op used in ranking head."""

update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) minimize_op = optimizer.minimize( loss=loss, global_step=tf.compat.v1.train.get_global_step()) train_op = tf.group([update_ops, minimize_op]) return train_op

ranking_head = tfr.head.create_ranking_head( loss_fn=loss_fn, eval_metric_fns=eval_metric_fns(), train_op_fn=_train_op_fn)

Putting It All Together in a Model Builder

model_fnc = tfr.model.make_groupwise_ranking_fn( group_score_fn=make_score_fn(), transform_fn=make_transform_fn(), group_size=_GROUP_SIZE, ranking_head=ranking_head)

Train and evaluate the ranker

def train_and_eval_fn(): config_proto = tf.ConfigProto(device_count={'GPU': 3 },log_device_placement=False, allow_soft_placement=False)

config_proto.gpu_options.per_process_gpu_memory_fraction = 0.8 config_proto.gpu_options.allow_growth = True

run_config = tf.estimator.RunConfig(save_checkpoints_steps=100,

model_dir=_MODEL_DIR,

                                  keep_checkpoint_max=5,
                                  keep_checkpoint_every_n_hours=5,
                                  session_config=config_proto,
                                  save_summary_steps=100,
                                  log_step_count_steps=100)

ranker = tf.estimator.Estimator( model_fn=model_fnc, model_dir=_MODEL_DIR, config=run_config)

train_input_fn = lambda: input_fn(_TRAIN_DATA_PATH) eval_input_fn = lambda: input_fn(_TEST_DATA_PATH, num_epochs=1)

train_spec = tf.estimator.TrainSpec( input_fn=train_input_fn, max_steps=_NUM_TRAIN_STEPS) eval_spec = tf.estimator.EvalSpec( name="eval", input_fn=eval_input_fn, throttle_secs=15) return (ranker, train_spec, eval_spec)

if name== "main" : ranker, train_spec, eval_spec = train_and_eval_fn() tf.estimator.train_and_evaluate(ranker, train_spec, eval_spec)

ramakumar1729 commented 5 years ago

Hi @mulangonando : Sorry for the delay in my reply. This code looks quite similar to the scripts in ranking/examples. It would be helpful if you can share a more specific issue, preferably that is particular to TF-Ranking. If you are facing Tensorflow related issues, I would recommend the issues on Tensorflow, or StackOverflow addresses lot of commonly seen problems.

mulangonando commented 5 years ago

Hi @ramakumar1729 , I guess yes : that it's TF related. Lemi close this for now and update when I finally work a way around it. Thanks otherwise