Closed mulangonando closed 5 years ago
@mulangonando : What is the list_size
you are using? This corresponds to number of documents per query. I would recommend to start with a smaller list_size
, and smaller number of features, and work up from here.
My list size is 1025 Wow, the number of features s at 1M
Lemi see. Challenge is generating the data alone takes me a while. Thanks, lemi try reducing these aspects. Was thinking perhaps I may need to add a saver ?
What is the saver you are referring to?
Like do I need to add such a line somewhere in the code? Or does the code alraeady take care of the saving of the model?
saver = tf.compat.v1.train.Saver([v1, v2]) saver = tf.compat.v1.train.Saver({v.op.name: v for v in [v1, v2]})
Because I see the Tensorflow tutorial says something like : "Note that you still have to call the save() method to save the model. Passing these arguments to the constructor will not save variables automatically for you."
I am not so experienced with Tensorflow, this task is an intro to it as well
Since we use the Estimator API to build models, it does save checkpoints when you train.
OK, I'm restructuring the code to use fewer params and shorter list. Just to be clear, the sparse version of the ranking truncates long list, as opposed to first shuffling, then truncate. I want a behavior that maintains the first 10 in the list in the order. then can truncate the rest
Hi @ramakumar1729 ,
Actually it seems my code is never starting to train in the first place. The GPUs are enumerated in the output then nothing more -- As well the process in not listed in GPUs [Yes I am using tensorflow-gpu 1.14, cuda 10 but the code never gets placed]
I guess it never begins training, there's this funny Warning :
WARNING:tensorflow:Estimator's model_fn (<function make_groupwise_ranking_fn.
Could there be something I missed? Why would the params not be passed if I used the code shared? And should I not expect more logging info after the GPU devices have been found and enumerated? Should the process not be listed under nvidia-smi ?
Sorry, too many questions. Just can't seem to get the sparse features code running right.
Thanks
@mulangonando Can you clear the contents of your model directory and rerun, and see if you are still getting this error?
Looks like this is related to this issue: https://github.com/tensorflow/models/issues/5790
The model directory is always empty. Has never had anything written in it. I keep checking that one.
Two major issues here. Process does not get placed on GPU and no output goes to model directory
Are you using tensorflow-gpu package instead of tensorflow, for GPU compatibility?
Yes, tensorflow-gpu 1.14 , I have tried 1.13 too but similar problem
On Wed, 4 Sep 2019, 20:54 Rama Kumar Pasumarthi, notifications@github.com wrote:
Are you using tensorflow-gpu package instead of tensorflow, for GPU compatibility?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/ranking/issues/107?email_source=notifications&email_token=AFKEYOGVTM74P7F4E2F2M43QH777XA5CNFSM4IPOBMHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD54TJHI#issuecomment-528037021, or mute the thread https://github.com/notifications/unsubscribe-auth/AFKEYOCUME5VSRK2M6LKQHTQH777XANCNFSM4IPOBMHA .
Can you share the code? This seems a generic Tensorflow issue, and not particular to TF-Ranking.
Below is pretty much the whole code [Mostly from the sample code given]
import six import os import numpy as np import tensorflow_ranking as tfr import tensorflow as tf import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
tf.compat.v1.enable_eager_execution() tf.compat.v1.set_random_seed(1234) tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)
_TRAIN_DATA_PATH = "data/train-1000-framenet-BERT-context-rel.tfrecords" _TEST_DATA_PATH = "data/dev-framenet-BERT-context-rel.tfrecords" _VOCAB_PATH = "data/vocab.txt" _LIST_SIZE = 250 _LABEL_FEATURE = "relevance" _PADDING_LABEL = -1 _LEARNING_RATE = 0.05 _BATCH_SIZE = 32 _HIDDEN_LAYER_DIMS = ["64", "32", "16"] _DROPOUT_RATE = 0.8 _GROUP_SIZE = 5 # Pointwise scoring. _MODEL_DIR = "models/model" _NUM_TRAIN_STEPS = 15 * 1000 _CHECKPOINT_DIR = "chk_points" _EMBEDDING_DIMENSION = 20
def context_feature_columns():
query_column = tf.feature_column.categorical_column_with_vocabulary_file( key="query_tokens", vocabulary_file=_VOCAB_PATH) query_embedding_column = tf.feature_column.embedding_column( query_column, _EMBEDDING_DIMENSION)
answ_column = tf.feature_column.categorical_column_with_vocabulary_file( key="answer_tokens", vocabulary_file=_VOCAB_PATH) answ_embedding_column = tf.feature_column.embedding_column( answ_column, _EMBEDDING_DIMENSION)
qid = tf.feature_column.numeric_column(key="qid",dtype=tf.int64)
overall_bert_encoding_column = tf.feature_column.numeric_column(key="overal_bert_context_encoding_out", shape=768)
context_features = {"query_tokens": query_embedding_column, "answer_tokens": answ_embedding_column, "qid": qid, "overal_bert_context_encoding_out": overall_bert_encoding_column}
return context_features
def example_feature_columns(): """Returns the example feature columns."""
expl_column = tf.feature_column.categorical_column_with_vocabulary_file( key="expl_tokens", vocabulary_file=_VOCAB_PATH) expl_embedding_column = tf.feature_column.embedding_column( expl_column, _EMBEDDING_DIMENSION)
relevance = tf.feature_column.numeric_column(key="relevance",dtype=tf.int64,default_value=_PADDING_LABEL)
examples_features = {"expl_tokens": expl_embedding_column,"relevance": relevance}
for fea in range(1,402212):
try :
feat = tf.feature_column.numeric_column(key=str(fea), dtype=tf.int64,default_value=0)
examples_features[""+str(fea)] = feat
except :
continue
# example[id] = tf.train.Feature(int64_list=tf.train.Int64List(value=[int(value)]))
return examples_features
def input_fn(path, num_epochs=None): context_feature_spec = tf.feature_column.make_parse_example_spec( context_feature_columns().values()) label_column = tf.feature_column.numeric_column( _LABEL_FEATURE, dtype=tf.int64, default_value=_PADDING_LABEL)
example_feature_spec = tf.feature_column.make_parse_example_spec( list(example_feature_columns().values()) + [label_column]) dataset = tfr.data.build_ranking_dataset( file_pattern=path, data_format=tfr.data.EIE, batch_size=_BATCH_SIZE, list_size=_LIST_SIZE, context_feature_spec=context_feature_spec, example_feature_spec=example_feature_spec, reader=tf.data.TFRecordDataset, shuffle=False, num_epochs=num_epochs) features = tf.data.make_one_shot_iterator(dataset).get_next() label = tf.squeeze(features.pop(_LABEL_FEATURE), axis=2) label = tf.cast(label, tf.float32)
return features, label
def make_transform_fn(): def _transform_fn(features, mode): """Defines transform_fn.""" example_name = next(six.iterkeys(example_feature_columns())) input_size = tf.shape(input=features[example_name])[1] context_features, example_features = tfr.feature.encode_listwise_features( features=features, input_size=input_size, context_feature_columns=context_feature_columns(), example_feature_columns=example_feature_columns(), mode=mode, scope="transform_layer")
return context_features, example_features
return _transform_fn
def make_score_fn():
"""Returns a scoring function to build EstimatorSpec
."""
def _score_fn(context_features, group_features, mode, params, config): """Defines the network to score a group of documents.""" with tf.compat.v1.name_scope("input_layer"): context_input = [ tf.compat.v1.layers.flatten(context_features[name]) for name in sorted(context_feature_columns()) ] group_input = [ tf.compat.v1.layers.flatten(group_features[name]) for name in sorted(example_feature_columns()) ] input_layer = tf.concat(context_input + group_input, 1)
is_training = (mode == tf.estimator.ModeKeys.TRAIN)
cur_layer = input_layer
cur_layer = tf.compat.v1.layers.batch_normalization(
cur_layer,
training=is_training,
momentum=0.99)
for i, layer_width in enumerate(int(d) for d in _HIDDEN_LAYER_DIMS):
cur_layer = tf.compat.v1.layers.dense(cur_layer, units=layer_width)
cur_layer = tf.compat.v1.layers.batch_normalization(
cur_layer,
training=is_training,
momentum=0.99)
cur_layer = tf.nn.relu(cur_layer)
cur_layer = tf.compat.v1.layers.dropout(
inputs=cur_layer, rate=_DROPOUT_RATE, training=is_training)
logits = tf.compat.v1.layers.dense(cur_layer, units=_GROUP_SIZE)
return logits
return _score_fn
def eval_metric_fns(): metric_fns = {} metric_fns.update({ "metric/ndcg@%d" % topn: tfr.metrics.make_ranking_metric_fn( tfr.metrics.RankingMetricKey.NDCG, topn=topn) for topn in [1, 3, 5, 10] })
return metric_fns
_LOSS = tfr.losses.RankingLossKey.APPROX_NDCG_LOSS loss_fn = tfr.losses.make_loss_fn(_LOSS)
optimizer = tf.compat.v1.train.AdagradOptimizer( learning_rate=_LEARNING_RATE)
def _train_op_fn(loss): """Defines train op used in ranking head."""
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) minimize_op = optimizer.minimize( loss=loss, global_step=tf.compat.v1.train.get_global_step()) train_op = tf.group([update_ops, minimize_op]) return train_op
ranking_head = tfr.head.create_ranking_head( loss_fn=loss_fn, eval_metric_fns=eval_metric_fns(), train_op_fn=_train_op_fn)
model_fnc = tfr.model.make_groupwise_ranking_fn( group_score_fn=make_score_fn(), transform_fn=make_transform_fn(), group_size=_GROUP_SIZE, ranking_head=ranking_head)
def train_and_eval_fn(): config_proto = tf.ConfigProto(device_count={'GPU': 3 },log_device_placement=False, allow_soft_placement=False)
config_proto.gpu_options.per_process_gpu_memory_fraction = 0.8 config_proto.gpu_options.allow_growth = True
run_config = tf.estimator.RunConfig(save_checkpoints_steps=100,
keep_checkpoint_max=5,
keep_checkpoint_every_n_hours=5,
session_config=config_proto,
save_summary_steps=100,
log_step_count_steps=100)
ranker = tf.estimator.Estimator( model_fn=model_fnc, model_dir=_MODEL_DIR, config=run_config)
train_input_fn = lambda: input_fn(_TRAIN_DATA_PATH) eval_input_fn = lambda: input_fn(_TEST_DATA_PATH, num_epochs=1)
train_spec = tf.estimator.TrainSpec( input_fn=train_input_fn, max_steps=_NUM_TRAIN_STEPS) eval_spec = tf.estimator.EvalSpec( name="eval", input_fn=eval_input_fn, throttle_secs=15) return (ranker, train_spec, eval_spec)
if name== "main" : ranker, train_spec, eval_spec = train_and_eval_fn() tf.estimator.train_and_evaluate(ranker, train_spec, eval_spec)
Hi @mulangonando : Sorry for the delay in my reply. This code looks quite similar to the scripts in ranking/examples. It would be helpful if you can share a more specific issue, preferably that is particular to TF-Ranking. If you are facing Tensorflow related issues, I would recommend the issues on Tensorflow, or StackOverflow addresses lot of commonly seen problems.
Hi @ramakumar1729 , I guess yes : that it's TF related. Lemi close this for now and update when I finally work a way around it. Thanks otherwise
HI guys,
I have been running the sparse-model for the last 5 days on GPU server and I can't see anything in my models directory. Meaning it did not get even to the first check point. Anyone has had experience with this?
My features are many though (About 900K) But I still expected to be past the first checkpoint.
Any hints would help here.
Thanks.