linkedin / detext

DeText: A Deep Neural Text Understanding Framework for Ranking and Classification Tasks
BSD 2-Clause "Simplified" License
1.26k stars 133 forks source link

integrate with arg_suite and minor refactoring with arg parsing. #25

Closed jakiejj closed 4 years ago

jakiejj commented 4 years ago

This is a backward-incompatible change, where arguments could directly be parsed into Lists and see the example for changes.

Description

The source code of arg_suite is temporarily included to enable immediate benefits. The source code will be removed once published to pypi and included as a dependency.

So no need to review folder src/arg_suite/ The main change to review: src/detext/run_detext.py

arg_suite introduction

Command-Line Arguments Parsing Suite This module is a command-line argument library that:

Type of change

Breaking changes

Testing

Demo output:

python run_detext.py -h


usage: run_detext.py [-h] --ftr_ext {cnn,bert,lstm_lm,lstm} [--num_units int]
[--num_units_for_id_ftr int] [--sp_emb_size int]
[--num_hidden [int [int ...]]] [--num_wide int]
[--num_wide_sp int] [--use_deep {True,False}]
[--elem_rescale {True,False}]
[--use_doc_projection {True,False}]
[--use_usr_projection {True,False}] [--ltr_loss_fn str]
[--emb_sim_func [{inner,hadamard,concat} [{inner,hadamard,concat} ...]]]
[--num_classes int]
[--filter_window_sizes [int [int ...]]]
[--num_filters int] [--explicit_empty {True,False}]
[--lr_bert float] [--bert_config_file str]
[--bert_checkpoint str] [--unit_type {lstm}]
[--num_layers int] [--num_residual_layers int]
[--forget_bias float] [--rnn_dropout float]
[--bidirectional {True,False}]
[--normalized_lm {True,False}]
[--optimizer {sgd,adam,bert_adam,bert_adam}]
[--max_gradient_norm float] [--learning_rate float]
[--num_train_steps int] [--num_epochs int]
[--num_warmup_steps int] [--train_batch_size int]
[--test_batch_size int] [--l1 float] [--l2 float]
[--train_file str] [--dev_file str] [--test_file str]
[--out_dir str] [--std_file str] [--max_len int]
[--min_len int] [--vocab_file str] [--we_file str]
[--we_trainable {True,False}] [--PAD str] [--SEP str]
[--CLS str] [--UNK str] [--MASK str]
[--vocab_file_for_id_ftr str] [--we_file_for_id_ftr str]
[--we_trainable_for_id_ftr {True,False}]
[--PAD_FOR_ID_FTR str] [--UNK_FOR_ID_FTR str]
[--random_seed int] [--steps_per_stats int]
[--num_eval_rounds int] [--steps_per_eval int]
[--keep_checkpoint_max int]
[--feature_names [str [str ...]]] [--lambda_metric str]
[--init_weight float] [--pmetric str]
[--all_metrics [str [str ...]]]
[--score_rescale [float [float ...]]]
[--tokenization {plain,punct}]
[--resume_training {True,False}] [--metadata_path str]
[--use_tfr_loss {True,False}]
[--tfr_loss_fn {softmax_loss,pairwise_logistic_loss}]
[--tfr_lambda_weights str] [--use_horovod {True,False}]
[--task_ids [int [int ...]]]
[--task_weights [float [float ...]]]

Args(ftr_ext, num_units, num_units_for_id_ftr, sp_emb_size, num_hidden, num_wide, num_wide_sp, use_deep, elem_rescale, use_doc_projection, use_usr_projection, ltr_loss_fn, emb_sim_func, num_classes, filter_window_sizes, num_filters, explicit_empty, lr_bert, bert_config_file, bert_checkpoint, unit_type, num_layers, num_residual_layers, forget_bias, rnn_dropout, bidirectional, normalized_lm, optimizer, max_gradient_norm, learning_rate, num_train_steps, num_epochs, num_warmup_steps, train_batch_size, test_batch_size, l1, l2, train_file, dev_file, test_file, out_dir, std_file, max_len, min_len, vocab_file, we_file, we_trainable, PAD, SEP, CLS, UNK, MASK, vocab_file_for_id_ftr, we_file_for_id_ftr, we_trainable_for_id_ftr, PAD_FOR_ID_FTR, UNK_FOR_ID_FTR, random_seed, steps_per_stats, num_eval_rounds, steps_per_eval, keep_checkpoint_max, feature_names, lambda_metric, init_weight, pmetric, all_metrics, score_rescale, tokenization, resume_training, metadata_path, use_tfr_loss, tfr_loss_fn, tfr_lambda_weights, use_horovod, task_ids, task_weights)

optional arguments: -h, --help show this help message and exit --ftr_ext {cnn,bert,lstm_lm,lstm} (str, required) NLP feature extraction module. --num_units int (int, default: 128) word embedding size. --num_units_for_id_ftr int (int, default: 128) id feature embedding size. --sp_emb_size int (int, default: 1) Embedding size of sparse features --num_hidden [int [int ...]] (List[int], default: 0) hidden size. --num_wide int (int, default: 0) number of wide features per doc. --num_wide_sp int (int, default: None) number of sparse wide features per doc --use_deep {True,False} (bool, default: True) Whether to use deep features. --elem_rescale {True,False} (bool, default: True) Whether to perform elementwise rescaling. --use_doc_projection {True,False} (bool, default: False) whether to project multiple doc features to 1 vector. --use_usr_projection {True,False} (bool, default: False) whether to project multiple usr features to 1 vector. --ltr_loss_fn str (str, default: pairwise) learning-to-rank method. --emb_sim_func [{inner,hadamard,concat} [{inner,hadamard,concat} ...]] (List[str], default: ['inner']) Approach to computing query/doc similarity scores --num_classes int (int, default: 1) Number of classes for multi-class classification tasks. --filter_window_sizes [int [int ...]] (List[int], default: 3) CNN filter window sizes. --num_filters int (int, default: 100) number of CNN filters. --explicit_empty {True,False} (bool, default: False) Explicitly modeling empty string in cnn --lr_bert float (float, default: None) Learning rate factor for bert components --bert_config_file str (str, default: None) bert config. --bert_checkpoint str (str, default: None) pretrained bert model checkpoint. --unit_type {lstm} (str, default: lstm) RNN cell unit type. Currently only supports lstm. Will support other cell types in the future --num_layers int (int, default: 1) RNN layers --num_residual_layers int (int, default: 0) Number of residual layers from top to bottom. For example, if num_layers=4 and num_residual_layers=2, the last 2 RNN cells in the returned list will be wrapped with ResidualWrapper. --forget_bias float (float, default: 1.0) Forget bias of RNN cell --rnn_dropout float (float, default: 0.0) Dropout of RNN cell --bidirectional {True,False} (bool, default: False) Whether to use bidirectional RNN --normalized_lm {True,False} (bool, default: False) Whether to use normalized lm. This option only works for unit_type=lstm_lm --optimizer {sgd,adam,bert_adam, bert_lamb} (str, default: sgd) Type of optimizer to use. bert_adam is similar to the optimizer implementation in bert. --max_gradient_norm float (float, default: 1.0) Clip gradients to this norm. --learning_rate float (float, default: 1.0) Learning rate. Adam: 0.001 | 0.0001 --num_train_steps int (int, default: 1) Num steps to train. --num_epochs int (int, default: None) Num of epochs to train, will overwrite train_steps if set --num_warmup_steps int (int, default: 0) Num steps for warmup. --train_batch_size int (int, default: 32) Training data batch size. --test_batch_size int (int, default: 32) Test data batch size. --l1 float (float, default: None) Scale of L1 regularization --l2 float (float, default: None) Scale of L2 regularization --train_file str (str, default: None) Train file. --dev_file str (str, default: None) Dev file. --test_file str (str, default: None) Test file. --out_dir str (str, default: None) Store log/model files. --std_file str (str, default: None) feature standardization file --max_len int (int, default: 32) max sent length. --min_len int (int, default: 3) min sent length. --vocab_file str (str, default: None) Vocab file --we_file str (str, default: None) Pretrained word embedding file --we_trainable {True,False} (bool, default: True) Whether to train word embedding --PAD str (str, default: [PAD]) Token for padding --SEP str (str, default: [SEP]) Token for sentence separation --CLS str (str, default: [CLS]) Token for start of sentence --UNK str (str, default: [UNK]) Token for unknown word --MASK str (str, default: [MASK]) Token for masked word --vocab_file_for_id_ftr str (str, default: None) Vocab file for id features --we_file_for_id_ftr str (str, default: None) Pretrained word embedding file for id features --we_trainable_for_id_ftr {True,False} (bool, default: True) Whether to train word embedding for id features --PAD_FOR_ID_FTR str (str, default: [PAD]) Padding token for id features --UNK_FOR_ID_FTR str (str, default: [UNK]) Unknown word token for id features --random_seed int (int, default: 1234) Random seed (>0, set a specific seed). --steps_per_stats int (int, default: 100) training steps to print statistics. --num_eval_rounds int (int, default: None) number of evaluation round, this param will override steps_per_eval as max(1, num_train_steps / num_eval_rounds) --steps_per_eval int (int, default: 1000) training steps to evaluate datasets. --keep_checkpoint_max int (int, default: 5) The maximum number of recent checkpoint files to keep. If 0, all checkpoint files are kept. Defaults to 5 --feature_names [str [str ...]] (List[str], default: None) the feature names. --lambda_metric str (str, default: None) only support ndcg. --init_weight float (float, default: 0.1) weight initialization value. --pmetric str (str, default: None) Primary metric. --all_metrics [str [str ...]] (List[str], default: None) All metrics. --score_rescale [float [float ...]] (List[float], default: None) The mean and std of previous model. For score rescaling, the score_rescale has the xgboost mean and std. --tokenization {plain,punct} (str, default: punct) The tokenzation performed for data preprocessing. Currently support: punct/plain(no split). Note that this should be set correctly to ensure consistency for savedmodel. --resume_training {True,False} (bool, default: False) Whether to resume training from checkpoint in out_dir. --metadata_path str (str, default: None) The metadata_path for converted avro2tf avro data. --use_tfr_loss {True,False} (bool, default: False) whether to use tf-ranking loss. --tfr_loss_fn {softmax_loss,pairwise_logistic_loss} (str, default: softmax_loss) tf-ranking loss --tfr_lambda_weights str (str, default: None) --use_horovod {True,False} (bool, default: False) whether to use horovod for sync distributed training --task_ids [int [int ...]] (List[int], default: None) All types of task IDs for multitask training. E.g., 1,2,3 --task_weights [float [float ...]] (List[float], default: None) Weights for each task specified in task_ids. E.g., 0.5,0.3,0.2

jakiejj commented 4 years ago

Thanks a lot for refactoring the code base! A general comment about the arg parsing: is it difficult/possible to make it compatible with the previous format? If not, all the training scripts will need to be updated.

It certainly could, but in principle, all parsing should be done at the same place for cohesion, e.g. split on "," in a utility function outside of the parser. It's less likely to break when you make the changes if they are localized.

I'd even suggest to further reduce the logic in src/detext/utils/misc_utils.py and move the argument processing logic to the parser.

Adding one line would revert to the old behavior _feature_names = {'type': lambda s: s.split(',') if ',' in s else s, 'nargs': None} This would forgo the native List parsing from argparse, but still, bring the logic closer to where it's supposed to be.

How much downstream work would there be if breaking LI internally? (external might be ok, since it's still pretty early)

jakiejj commented 4 years ago

reverted a few commonly used List and bool parsing behaviors. reverted one sh example with "=" to cover more cases

guoweiwei commented 4 years ago

It seems to me that only the maximum window size is used. How about we change the filter_window_sizes (string) to filter_window_size (int), and get rid of all the list/string/int conversion? Correct me if missed another corner case that blocks this suggestion :)