Open berkuva opened 2 years ago
[NOT SOLVED YET] @jerryji1993 and @Zhihan1996
Here's how to create vocab.txt for k-mer for k=10. Change the number of times bases below for a different k.
# source: https://stackoverflow.com/a/38202625
import itertools
bases = 'ACTG'
vocabs = ["".join(seq) for seq in list(itertools.product(bases, bases, bases, bases, bases, bases, bases, bases, bases, bases))]
with open('PATH_TO_vocab.txt', 'w') as f:
for vocab in vocabs:
f.write(vocab)
f.write('\n')
Then add the following to it.
[PAD]
[UNK]
[CLS]
[SEP]
[MASK]
For k=10, vocab size is 4^10+5 = 1048581.
Besides vocab.txt
, I created bert-config-10 folder in DNABERT/src/transformers/dnabert-config/ and created config.json
, special_tokens_map.json
, and tokenizer_config.json
inside it along with vocab.txt
.
My sequence length is 20000, so I
Edited vocab_size=1048581
and max_position_embeddings=20000
in config.json
Edited max_len=20000
in tokenizer_config.json.
Edit VOCAB_KMER
in line 54 in tokenization_dna.py.
VOCAB_KMER = {
"69": "3",
"261": "4",
"1029": "5",
"4101": "6",
"1048581":"10"}
From #39 :
Please use the tag
--model_type=dnalong
and set the--block_size
as a multiple of 512. The DNABERT and DNABERT-XL use the same checkpoint (parameters).
dnalongcat and dnalong for --model_type do not work: notice --block_size 20480, which I got from multiplying 512 by 40 (first multiple of 510 greater than my sequence length of 20000).
cd examples
export KMER=10 export TRAIN_FILE=PATH_TO_MY_10MER_FILE.txt export TEST_FILE=PATH_TO_MY_10MER_FILE.txt export SOURCE=PATH_TODNABERT export OUTPUT_PATH=output$KMER
python run_pretrain.py \ --output_dir $OUTPUT_PATH \ --model_type=dna \ --tokenizer_name=PATH_TO_vocab.txt \ --config_name=$SOURCE/src/transformers/dnabert-config/bert-config-$KMER/config.json \ --do_train \ --train_data_file=$TRAIN_FILE \ --do_eval \ --eval_data_file=$TEST_FILE \ --mlm \ --gradient_accumulation_steps 25 \ --per_gpu_train_batch_size 10 \ --per_gpu_eval_batch_size 6 \ --save_steps 500 \ --save_total_limit 20 \ --max_steps 200000 \ --evaluate_during_training \ --logging_steps 500 \ --line_by_line \ --learning_rate 4e-4 \ --block_size 20480 \ --adam_epsilon 1e-6 \ --weight_decay 0.01 \ --beta1 0.9 \ --beta2 0.98 \ --mlm_probability 0.025 \ --warmup_steps 10000 \ --overwrite_output_dir \ --n_process 24
Error: KeyError: '10'
============================================================ <class 'transformers.tokenization_dna.DNATokenizer'> 08/31/2022 15:33:27 - INFO - transformers.tokenization_utils - Model name '/Users/hyunjaecho/Desktop/code/DNABERT/src/transformers/dnabert-config/bert-config-10/vocab.txt' not found in model shortcut name list (dna3, dna4, dna5, dna6, dna10). Assuming '/Users/hyunjaecho/Desktop/code/DNABERT/src/transformers/dnabert-config/bert-config-10/vocab.txt' is a path, a model identifier, or url to a directory containing tokenizer files. 08/31/2022 15:33:27 - WARNING - transformers.tokenization_utils - Calling DNATokenizer.from_pretrained() with the path to a single file or url is deprecated 08/31/2022 15:33:27 - INFO - transformers.tokenization_utils - loading file /Users/hyunjaecho/Desktop/code/DNABERT/src/transformers/dnabert-config/bert-config-10/vocab.txt 08/31/2022 15:33:28 - INFO - main - Training new model from scratch 08/31/2022 15:34:12 - INFO - main - Training/evaluation parameters Namespace(adam_epsilon=1e-06, beta1=0.9, beta2=0.98, block_size=20480, cache_dir=None, config_name='/Users/hyunjaecho/Desktop/code/DNABERT/src/transformers/dnabert-config/bert-config-10/config.json', device=device(type='cpu'), do_eval=True, do_train=True, eval_all_checkpoints=False, eval_data_file='/Users/hyunjaecho/Desktop/code/unencoded/chrY.txt', evaluate_during_training=True, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=25, learning_rate=0.0004, line_by_line=True, local_rank=-1, logging_steps=500, max_grad_norm=1.0, max_steps=200000, mlm=True, mlm_probability=0.025, model_name_or_path=None, model_type='dna', n_gpu=0, n_process=24, no_cuda=False, num_train_epochs=1.0, output_dir='output10', overwrite_cache=False, overwrite_output_dir=True, per_gpu_eval_batch_size=6, per_gpu_train_batch_size=10, save_steps=500, save_total_limit=20, seed=42, server_ip='', server_port='', should_continue=False, tokenizer_name='/Users/hyunjaecho/Desktop/code/DNABERT/src/transformers/dnabert-config/bert-config-10/vocab.txt', train_data_file='/Users/hyunjaecho/Desktop/code/unencoded/chrY.txt', warmup_steps=10000, weight_decay=0.01) 08/31/2022 15:34:12 - INFO - main - Loading features from cached file /Users/hyunjaecho/Desktop/code/unencoded/dna_cached_lm_20480_chrY.txt 08/31/2022 15:34:12 - INFO - main - Running training 08/31/2022 15:34:12 - INFO - main - Num examples = 555 08/31/2022 15:34:12 - INFO - main - Num Epochs = 100001 08/31/2022 15:34:12 - INFO - main - Instantaneous batch size per GPU = 10 08/31/2022 15:34:12 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 250 08/31/2022 15:34:12 - INFO - main - Gradient Accumulation steps = 25 08/31/2022 15:34:12 - INFO - main - Total optimization steps = 200000 Iteration: 0%| | 0/56 [00:00<?, ?it/s] Epoch: 0%| | 0/100001 [00:00<?, ?it/s] Traceback (most recent call last): File "run_pretrain.py", line 888, in
main() File "run_pretrain.py", line 838, in main global_step, tr_loss = train(args, train_dataset, model, tokenizer) File "run_pretrain.py", line 421, in train inputs, labels = mask_tokens(batch, tokenizer, args) if args.mlm else (batch, batch) File "run_pretrain.py", line 254, in mask_tokens mask_list = MASK_LIST[tokenizer.kmer] KeyError: '10'
I want to use kmer=26, for that how can I prepare the vocab.txt file? because I tried the above code for that, but the process has been killed
I also want to pretrain a model for kmer=9 but unable to do that please help... @jerryji1993
I am trying to train the model on my own data, which consists of 10-mers. Running part 2.2 with the following command:
I am getting the following error:
I believe this could be because k-10 is making vocab.txt unable to be found (#55 ). How do I train the model from scratch using a different k than 3,4,5,6? Create our own vocab.txt file? Thanks.