kamalkraj / ALBERT-TF2.0

ALBERT model Pretraining and Fine Tuning using TF2.0
Apache License 2.0
199 stars 45 forks source link

do_predict? #3

Closed acmilannesta closed 4 years ago

acmilannesta commented 4 years ago

Hi,

Can the script do predict? I may miss it, but I didn't see a "do_pred" flag.

kamalkraj commented 4 years ago

Hi @acmilannesta, do_predict added.

acmilannesta commented 4 years ago

Great, thx so much!

But do you also need to add something like prediction data output in "create_finetuning_data.py"?

kamalkraj commented 4 years ago

@acmilannesta for running predict, additional flags

--do_predict --predict_data_path=${OUTPUT_DIR}/${TASK_NAME}_predict.tf_record --input_data_dir=${GLUE_DIR}/

Full cmd

python run_classifer.py --train_data_path=${OUTPUT_DIR}/${TASK_NAME}_train.tf_record --eval_data_path=${OUTPUT_DIR}/${TASK_NAME}_eval.tf_record --input_meta_data_path=${OUTPUT_DIR}/${TASK_NAME}_meta_data --albert_config_file=${ALBERT_DIR}/config.json --task_name=${TASK_NAME} --spm_model_file=${ALBERT_DIR}/vocab/30k-clean.model --output_dir=${MODEL_DIR} --init_checkpoint=${ALBERT_DIR}/tf2_model.h5 --do_train --do_eval --train_batch_size=128 --learning_rate=1e-5 --custom_training_loop --do_predict --predict_data_path=${OUTPUT_DIR}/${TASK_NAME}_predict.tf_record --input_data_dir=${GLUE_DIR}/
acmilannesta commented 4 years ago

Epoch 1/3 7698/7698 [==============================] - 9878s 1s/step - loss: 0.8664 - accuracy: 0.7629 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00 Epoch 2/3 2445/1925 [======================================] - 2886s 1s/step - loss: 0.5667 - accuracy: 0.8351

Got it. Now I got into another issue. It seems the 2nd training epoch will use my validation set. My training set has 7698 batches, and my validation set has 1925 batches.

kamalkraj commented 4 years ago

Second epoch won't use validation dataset as training set, even though it seems like that. Model.fit api has known issues when tf.data is given as input , steps calculation issues. Only after first epoch model will know the complete data size . Ignore it.

You can also train using --custom_training_loop

acmilannesta commented 4 years ago

Thank you! I've just run my model on AWS g4dn.xlarge (Tesla T4 GPU) VM. However, it almost took 3 times longer than running bert-base model. Here is my code:

    os.system(
    '/home/ubuntu/py3env/bin/python3.6 ALBERT-TF2.0/run_classifer.py \
        --train_data_path=input/COLA_train.tf_record \
        --eval_data_path=input/COLA_eval.tf_record \
        --predict_data_path= input/COLA_predict.tf_record \
        --input_data_dir=input/CoLA \
        --input_meta_data_path=input/COLA_meta_data \
        --albert_config_file=base_2/config.json \
        --task_name=CoLA \
        --spm_model_file=base_2/vocab/30k-clean.model \
        --output_dir=output \
        --init_checkpoint=base_2/tf2_model.h5 \
        --do_train \
        --do_eval \
        --do_predict \
        --train_batch_size=16 \
        --eval_batch_size=16 \
        --learning_rate=5e-5 \
        --max_seq_length=142 \
        --num_train_epochs=3 \
        --custom_training_loop'
    )
kamalkraj commented 4 years ago

@acmilannesta How much time for bert and Albert ?

acmilannesta commented 4 years ago

About 170-175 ms/step on bert 480-490 ms/step on albert

Kamal Raj notifications@github.com 于2019年11月11日周一 上午8:10写道:

@acmilannesta https://github.com/acmilannesta How much time for bert and Albert ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kamalkraj/ALBERT-TF2.0/issues/3?email_source=notifications&email_token=ALL6NUWY7UFIKTTK6X4CWV3QTFKTTA5CNFSM4JJMRWA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDWY3FA#issuecomment-552439188, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALL6NUWUQN4MF2UYGU5MKELQTFKTTANCNFSM4JJMRWAQ .

kamalkraj commented 4 years ago

Which implementation of bert are you using ? Is batch size same in bert and albert training ?

acmilannesta commented 4 years ago

I used a package called keras_bert (https://github.com/CyberZHG/keras-bert), whcih basically load ckpt into keras model.

Yes, the batch size is same = 16.

kamalkraj commented 4 years ago

TensorFlow version for keras_bert ?

acmilannesta commented 4 years ago

Yes.

kamalkraj commented 4 years ago

Tensorflow 1.X or 2.0 ?

acmilannesta commented 4 years ago

I think both work. But I'm running on 1.x

kamalkraj commented 4 years ago

Tensorflow 2 is slower than 1.x https://github.com/tensorflow/tensorflow/issues/33487

kamalkraj commented 4 years ago

https://github.com/tensorflow/models/tree/master/official/nlp/bert TF 2.0 implementation of BERT

acmilannesta commented 4 years ago

I see. Let me try this https://github.com/google-research/google-research/tree/master/albert It seems work on tensorflow 1.x

kamalkraj commented 4 years ago

The above one runs on TensorFlow 1.15

acmilannesta commented 4 years ago

I just tried. About 2.4 steps/second (416 ms/step).

acmilannesta commented 4 years ago

https://github.com/tensorflow/tensorflow/issues/33487#issuecomment-543652394

Based on this comment, would you suggest turning off eager in tf2.0 to run your codes?

kamalkraj commented 4 years ago

@acmilannesta Try and let me know the results

acmilannesta commented 4 years ago

Unfortunately, it doesn't allow for disabling eager in tf2.0.

Kamal Raj notifications@github.com 于2019年11月11日周一 下午1:01写道:

@acmilannesta https://github.com/acmilannesta Try and let me know the results

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kamalkraj/ALBERT-TF2.0/issues/3?email_source=notifications&email_token=ALL6NUSJ64UUGRPMVMBHOCTQTGMYFA5CNFSM4JJMRWA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDXTVQI#issuecomment-552549057, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALL6NUSGQ6ODPQ6NIJQGXZLQTGMYFANCNFSM4JJMRWAQ .