stevezheng23 / xlnet_extension_tf

XLNet Extension in TensorFlow
Apache License 2.0
130 stars 26 forks source link

How do I run ner on other language like chinese? #31

Open SuMarsss opened 5 years ago

SuMarsss commented 5 years ago

I have pretrained xlnet on a large chinese corpus, but how do I run the ner.py and what is label.vocab. Here is my parameters to train the Sentence Piece model

spm_train \
    --input=data/wiki_all.txt \
    --model_prefix=sp10m.cased.v3 \
    --vocab_size=32000 \
    --character_coverage=0.9995 \
    --model_type=char \
    --control_symbols='<cls>,<sep>,<pad>,<mask>,<eod>' \
    --user_defined_symbols='<eop>,。' \
    --shuffle_input_sentence \
    --input_sentence_size=10000000

This my pretrained result.

I0708 01:51:08.929747 140337454118720 train_gpu.py:300] [99500] | gnorm 5.37 lr 0.000000 | loss 2.08 | pplx    8.01, bpc  3.0017
I0708 01:52:52.577970 140337454118720 train_gpu.py:300] [99600] | gnorm 4.98 lr 0.000000 | loss 2.03 | pplx    7.60, bpc  2.9265
I0708 01:54:36.169189 140337454118720 train_gpu.py:300] [99700] | gnorm 5.21 lr 0.000000 | loss 2.04 | pplx    7.73, bpc  2.9500
I0708 01:56:19.727979 140337454118720 train_gpu.py:300] [99800] | gnorm 5.06 lr 0.000000 | loss 2.05 | pplx    7.79, bpc  2.9625
I0708 01:58:03.187680 140337454118720 train_gpu.py:300] [99900] | gnorm 5.06 lr 0.000000 | loss 2.01 | pplx    7.47, bpc  2.9009
I0708 01:59:46.560450 140337454118720 train_gpu.py:300] [100000] | gnorm 5.51 lr 0.000000 | loss 2.00 | pplx    7.38, bpc  2.8840

So the label.vocabshould be like this ?

<cls>
<sep>
<pad>
<mask>
<eod>
B-AnatomyPart
I-AnatomyPart
B-Diagnosis
I-Diagnosis
B-Drug
I-Drug
B-Lab
I-Lab
B-Procedure
I-Procedure
B-Radiology
I-Radiology
O
stevezheng23 commented 5 years ago

@SuMarsss great to see you have trained Chinese XLNet model and build your own Sentence Piece model

To prepare your label.vocab (which is different from your Sentence Piece control_symbols), you can use the following one,

<pad>
O
X
<cls>
<sep>
B-AnatomyPart
I-AnatomyPart
B-Diagnosis
I-Diagnosis
B-Drug
I-Drug
B-Lab
I-Lab
B-Procedure
I-Procedure
B-Radiology
I-Radiology
stevezheng23 commented 5 years ago

And you should also make sure the special_vocab_list in run_ner.py align with your Sentence Piece control_symbols, self.special_vocab_list = ["<unk>", "<s>", "</s>", "<cls>", "<sep>", "<pad>", "<mask>", "<eod>", "<eop>"]

SuMarsss commented 5 years ago

special_vocab_list

When I tried the label.vocal as you said , another error occured.

InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had NaN values [[node VerifyFinite/CheckNumerics (defined at xlnet/model_utils.py:147) ]] [[node replica_1/loss/truediv (defined at run_ner.py:608) ]]

xlnet/model_utils.py:147: clipped, gnorm = tf.clip_by_global_norm(gradients, FLAGS.clip)

run_ner.py:608: loss = tf.reduce_sum(cross_entropy * label_mask) / tf.reduce_sum(tf.reduce_max(label_mask, axis=-1))

stevezheng23 commented 5 years ago

Looks like gradient exploding issue, could you provide more details (e.g. all vocab list, hyperparam, sentence piece model, etc.) for debugging?

On Wed, Jul 10, 2019 at 12:18 AM SuMarsss notifications@github.com wrote:

special_vocab_list

When I tried the label.vocal as you said , another error occured.

InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had NaN values [[node VerifyFinite/CheckNumerics (defined at xlnet/model_utils.py:147) ]] [[node replica_1/loss/truediv (defined at run_ner.py:608) ]]

xlnet/model_utils.py:147: clipped, gnorm = tf.clip_by_global_norm(gradients, FLAGS.clip)

run_ner.py:608: loss = tf.reduce_sum(cross_entropy * label_mask) / tf.reduce_sum(tf.reduce_max(label_mask, axis=-1))

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stevezheng23/xlnet_extension_tf/issues/31?email_source=notifications&email_token=ABYXYMZTJ5HD363JPI3GJ7LP6WENDA5CNFSM4H7CE2WKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZSRNCQ#issuecomment-509941386, or mute the thread https://github.com/notifications/unsubscribe-auth/ABYXYM5CUCY5BAYUKO4LWFDP6WENDANCNFSM4H7CE2WA .

-- Best, Mingzhi

SuMarsss commented 5 years ago

I have fiix the buged, but I want do output f1_score and precison

stevezheng23 commented 5 years ago

@SuMarsss , you can run the following command to get precision/recall/f1 score

python tool/convert_token.py \
--input_file=${OUTPUTDIR}/data/predict.${PREDICTTAG}.json \
--output_file=${OUTPUTDIR}/data/predict.${PREDICTTAG}.txt

python tool/eval_token.py \
< ${OUTPUTDIR}/data/predict.${PREDICTTAG}.txt \
> ${OUTPUTDIR}/data/predict.${PREDICTTAG}.token
SuMarsss commented 5 years ago

Sorry, I thought I have fixed the gradient exploding issue but it occured again. 2019-07-11 10:06:26.659641: E tensorflow/core/kernels/check_numerics_op.cc:185] abnormal_detected_host @0x7f65eb46c500 = {1, 0} Found Inf or NaN global norm. I think there are some problems with my sentence piece model or chinese tokenizer. Here is the result of my tokenized result. image I think the result "缘" "于" is wrong,which splits __ and "缘" and the correct result may be "_缘""_于". Cuz the english tokenized result is "_EU" "_reject". image

In the last, I don't konw how to provide details of all vocab list which is a too large txt and sentence piece model which is a binary file. I can only provide detail like this.

sample of all vocab list:


<unk>   0
<s>     0
</s>    0
<cls>   0
<sep>   0
<pad>   0
<mask>  0
<eod>   0
<eop>   0
。      0
,       -3.29251
▁       -3.45567
的      -3.76215
1       -4.30766
0       -4.54219
年      -4.64991
2       -4.74569
、      -4.8037
一      -4.90536
在      -4.91364
为      -4.94451
是      -5.03084
中      -5.04317
9       -5.05516
国      -5.06382
)       -5.0947
(       -5.09492
人      -5.09874
于      -5.26198
stevezheng23 commented 5 years ago

@SuMarsss , Yes, I think it should be _于 instead of _ and

I never did Chinese sentence piece model training before, maybe you can refer to this post for more insight

charlesXu86 commented 5 years ago

Sorry, I thought I have fixed the gradient exploding issue but it occured again. 2019-07-11 10:06:26.659641: E tensorflow/core/kernels/check_numerics_op.cc:185] abnormal_detected_host @0x7f65eb46c500 = {1, 0} Found Inf or NaN global norm. I think there are some problems with my sentence piece model or chinese tokenizer. Here is the result of my tokenized result. image I think the result "缘" "于" is wrong,which splits __ and "缘" and the correct result may be "_缘""_于". Cuz the english tokenized result is "_EU" "_reject". image

In the last, I don't konw how to provide details of all vocab list which is a too large txt and sentence piece model which is a binary file. I can only provide detail like this.

sample of all vocab list:

<unk>   0
<s>     0
</s>    0
<cls>   0
<sep>   0
<pad>   0
<mask>  0
<eod>   0
<eop>   0
。      0
,       -3.29251
▁       -3.45567
的      -3.76215
1       -4.30766
0       -4.54219
年      -4.64991
2       -4.74569
、      -4.8037
一      -4.90536
在      -4.91364
为      -4.94451
是      -5.03084
中      -5.04317
9       -5.05516
国      -5.06382
)       -5.0947
(       -5.09492
人      -5.09874
于      -5.26198

how did you fix this problem?

stevezheng23 commented 5 years ago

@charlesXu86 actually I couldn't reproduce this issue, no clue how to resolve it

youbingchenyoubing commented 5 years ago

Sorry, I thought I have fixed the gradient exploding issue but it occured again. 2019-07-11 10:06:26.659641: E tensorflow/core/kernels/check_numerics_op.cc:185] abnormal_detected_host @0x7f65eb46c500 = {1, 0} Found Inf or NaN global norm. I think there are some problems with my sentence piece model or chinese tokenizer. Here is the result of my tokenized result. image I think the result "缘" "于" is wrong,which splits __ and "缘" and the correct result may be "_缘""_于". Cuz the english tokenized result is "_EU" "_reject". image

In the last, I don't konw how to provide details of all vocab list which is a too large txt and sentence piece model which is a binary file. I can only provide detail like this.

sample of all vocab list:

<unk>   0
<s>     0
</s>    0
<cls>   0
<sep>   0
<pad>   0
<mask>  0
<eod>   0
<eop>   0
。      0
,       -3.29251
▁       -3.45567
的      -3.76215
1       -4.30766
0       -4.54219
年      -4.64991
2       -4.74569
、      -4.8037
一      -4.90536
在      -4.91364
为      -4.94451
是      -5.03084
中      -5.04317
9       -5.05516
国      -5.06382
)       -5.0947
(       -5.09492
人      -5.09874
于      -5.26198

this issue that you fix already or not, I got this problem too.

stevezheng23 commented 5 years ago

@youbingchenyoubing no fix is applied yet, since I couldn't reproduce this issue. Could you provide more details for your problem?

youbingchenyoubing commented 5 years ago

@youbingchenyoubing no fix is applied yet, since I couldn't reproduce this issue. Could you provide more details for your problem?

File "/home/chenyoubing/virtualplace/xlnet/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1323, in call_without_tpu return self._call_model_fn(features, labels, is_export_mode=is_export_mode) File "/home/chenyoubing/virtualplace/xlnet/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1593, in _call_model_fn estimator_spec = self._model_fn(features=features, *kwargs) File "/home/chenyoubing/nlp/resume_entity/entity_model/build_model/xlnet_model.py", line 135, in model_fn trainop, , _ = model_utils.get_train_op(self.args, loss) File "/home/chenyoubing/nlp/resume_entity/entity_model/xlnet/model_utils.py", line 147, in get_train_op clipped, gnorm = tf.clip_by_global_norm(gradients, FLAGS.clip) File "/home/chenyoubing/virtualplace/xlnet/lib/python3.6/site-packages/tensorflow/python/ops/clip_ops.py", line 271, in clip_by_global_norm "Found Inf or NaN global norm.") File "/home/chenyoubing/virtualplace/xlnet/lib/python3.6/site-packages/tensorflow/python/ops/numerics.py", line 44, in verify_tensor_all_finite return verify_tensor_all_finite_v2(t, msg, name) File "/home/chenyoubing/virtualplace/xlnet/lib/python3.6/site-packages/tensorflow/python/ops/numerics.py", line 62, in verify_tensor_all_finite_v2 verify_input = array_ops.check_numerics(x, message=message) File "/home/chenyoubing/virtualplace/xlnet/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 919, in check_numerics "CheckNumerics", tensor=tensor, message=message, name=name) File "/home/chenyoubing/virtualplace/xlnet/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper op_def=op_def) File "/home/chenyoubing/virtualplace/xlnet/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func return func(args, **kwargs) File "/home/chenyoubing/virtualplace/xlnet/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op op_def=op_def) File "/home/chenyoubing/virtualplace/xlnet/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in init self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had NaN values [[node VerifyFinite/CheckNumerics (defined at /home/chenyoubing/nlp/resume_entity/entity_model/xlnet/model_utils.py:147) ]]

stevezheng23 commented 5 years ago

@youbingchenyoubing Sorry, based on the error message, I can't figure out how run_ner.py is used by your pipeline. BTW, which dataset does this experiment run with? English or Chinese?

youbingchenyoubing commented 5 years ago

@youbingchenyoubing Sorry, based on the error message, I can't figure out how run_ner.py is used by your pipeline. BTW, which dataset does this experiment run with? English or Chinese?

chinese resume ner used in my experiment.

youbingchenyoubing commented 5 years ago

can xlnet support no fixed context?

stevezheng23 commented 5 years ago

@SuMarsss / @charlesXu86 / @youbingchenyoubing, sorry, I still can't repro this issue on CoNLL2003 dataset and I think I'll not support Chinese NER in the near future

youbingchenyoubing commented 5 years ago

@SuMarsss / @charlesXu86 / @youbingchenyoubing, sorry, I still can't repro this issue on CoNLL2003 dataset and I think I'll not support Chinese NER in the near future

awsome, thx