microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.11k stars 2.44k forks source link

Pretraining on new dataset #512

Open sanakhamekhem opened 2 years ago

sanakhamekhem commented 2 years ago

Dear Sir, I would like to pretrain my model on my own dataset like the IAM database. I'm asking if I should do it using the following command line.

OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=16 run_beit_pretraining.py \ --data_path ${DATA_PATH} --output_dir ${OUTPUT_DIR} --num_mask_patches 75 \ --model beit_base_patch16_224_8k_vocab --discrete_vae_weight_path ${TOKENIZER_PATH} \ --batch_size 128 --lr 1.5e-3 --warmup_steps 10000 --epochs 150 \ --clip_grad 3.0 --drop_path 0.1 --layer_scale_init_value 0.1

Then, I will fine tune my model using the proposed command line in this repo.:

CUDA_VISIBLE_DEVICES=1 python -m torch.distributed.launch --nproc_per_node=1 \ $(which fairseq-train) \ --data-type STR --user-dir ./ --task text_recognition \ --arch beit_large_decoder_large \ # or beit_base_decoder_large --seed 1111 --optimizer adam --lr 2e-05 --bpe gpt2 --lr-scheduler inverse_sqrt \ --warmup-init-lr 1e-8 --warmup-updates 500 --weight-decay 0.0001 --log-format tqdm \ --log-interval 10 --batch-size ${BSZ} --batch-size-valid ${valid_BSZ} --save-dir ${SAVE_PATH} \ --tensorboard-logdir ${LOG_DIR} --max-epoch 300 --patience 20 --ddp-backend legacy_ddp \ --num-workers 8 --preprocess DA2 --decoder-pretrained roberta --update-freq 1 \ --finetune-from-model $pretrainedPathlarge --fp16 \

Thank you in advance.

619268640 commented 2 years ago

@sanakhamekhem,How could I pretrain my model on custom data(other language)? Look forward your reply, thank you.

nissansz commented 2 years ago

hi, where to download pretrained models for Japanese, Korean, etc.? steve8000818@gmail.com