training with different k-mer

berkuva commented 2 years ago

I am trying to train the model on my own data, which consists of 10-mers. Running part 2.2 with the following command:

cd examples

export KMER=10 export TRAIN_FILE=PATH_TO_MY_10MER_FILE.txt export TEST_FILE=PATH_TO_MY_10MER_FILE.txt export SOURCE=PATH_TO_DNABERT_REPO export OUTPUT_PATH=output$KMER

python run_pretrain.py \ --output_dir $OUTPUT_PATH \ --model_type=dna \ --tokenizer_name=dna$KMER \ --config_name=$SOURCE/src/transformers/dnabert-config/bert-config-$KMER/config.json \ --do_train \ --train_data_file=$TRAIN_FILE \ --do_eval \ --eval_data_file=$TEST_FILE \ --mlm \ --gradient_accumulation_steps 25 \ --per_gpu_train_batch_size 10 \ --per_gpu_eval_batch_size 6 \ --save_steps 500 \ --save_total_limit 20 \ --max_steps 200000 \ --evaluate_during_training \ --logging_steps 500 \ --line_by_line \ --learning_rate 4e-4 \ --block_size 512 \ --adam_epsilon 1e-6 \ --weight_decay 0.01 \ --beta1 0.9 \ --beta2 0.98 \ --mlm_probability 0.025 \ --warmup_steps 10000 \ --overwrite_output_dir \ --n_process 24

I am getting the following error:

OSError: Model name '/Users/hyunjaecho/Desktop/code/DNABERT/src/transformers/dnabert-config/bert-config-10/config.json' was not found in model name list. We assumed 'https://s3.amazonaws.com/models.huggingface.co/bert//Users/hyunjaecho/Desktop/code/DNABERT/src/transformers/dnabert-config/bert-config-10/config.json/config.json' was a path, a model identifier, or url to a configuration file named config.json or a directory containing such a file but couldn't find any such file at this path or url.

I believe this could be because k-10 is making vocab.txt unable to be found (#55 ). How do I train the model from scratch using a different k than 3,4,5,6? Create our own vocab.txt file? Thanks.

berkuva commented 2 years ago

[NOT SOLVED YET] @jerryji1993 and @Zhihan1996

1. Created vocab.txt from your k-mer. For me, k=10.

Here's how to create vocab.txt for k-mer for k=10. Change the number of times bases below for a different k.

# source: https://stackoverflow.com/a/38202625
import itertools

bases = 'ACTG'

vocabs = ["".join(seq) for seq in list(itertools.product(bases, bases, bases, bases, bases, bases, bases, bases, bases, bases))]

with open('PATH_TO_vocab.txt', 'w') as f:
    for vocab in vocabs:
        f.write(vocab)
        f.write('\n')

Then add the following to it.

[PAD]
[UNK]
[CLS]
[SEP]
[MASK]

For k=10, vocab size is 4^10+5 = 1048581.

2. bert-config-10 folder

Besides vocab.txt, I created bert-config-10 folder in DNABERT/src/transformers/dnabert-config/ and created config.json, special_tokens_map.json, and tokenizer_config.json inside it along with vocab.txt. My sequence length is 20000, so I Edited vocab_size=1048581 and max_position_embeddings=20000 in config.json Edited max_len=20000 in tokenizer_config.json.

3. Edited tokenization_dna.py

Edit VOCAB_KMER in line 54 in tokenization_dna.py.

VOCAB_KMER = {
    "69": "3",
    "261": "4",
    "1029": "5",
    "4101": "6",
    "1048581":"10"}

4. DNABERT-XL?:

From #39 :

Please use the tag --model_type=dnalong and set the --block_size as a multiple of 512. The DNABERT and DNABERT-XL use the same checkpoint (parameters).

dnalongcat and dnalong for --model_type do not work: notice --block_size 20480, which I got from multiplying 512 by 40 (first multiple of 510 greater than my sequence length of 20000).

cd examples

export KMER=10 export TRAIN_FILE=PATH_TO_MY_10MER_FILE.txt export TEST_FILE=PATH_TO_MY_10MER_FILE.txt export SOURCE=PATH_TODNABERT export OUTPUT_PATH=output$KMER

python run_pretrain.py \ --output_dir $OUTPUT_PATH \ --model_type=dna \ --tokenizer_name=PATH_TO_vocab.txt \ --config_name=$SOURCE/src/transformers/dnabert-config/bert-config-$KMER/config.json \ --do_train \ --train_data_file=$TRAIN_FILE \ --do_eval \ --eval_data_file=$TEST_FILE \ --mlm \ --gradient_accumulation_steps 25 \ --per_gpu_train_batch_size 10 \ --per_gpu_eval_batch_size 6 \ --save_steps 500 \ --save_total_limit 20 \ --max_steps 200000 \ --evaluate_during_training \ --logging_steps 500 \ --line_by_line \ --learning_rate 4e-4 \ --block_size 20480 \ --adam_epsilon 1e-6 \ --weight_decay 0.01 \ --beta1 0.9 \ --beta2 0.98 \ --mlm_probability 0.025 \ --warmup_steps 10000 \ --overwrite_output_dir \ --n_process 24

Error: KeyError: '10'

============================================================ <class 'transformers.tokenization_dna.DNATokenizer'> 08/31/2022 15:33:27 - INFO - transformers.tokenization_utils - Model name '/Users/hyunjaecho/Desktop/code/DNABERT/src/transformers/dnabert-config/bert-config-10/vocab.txt' not found in model shortcut name list (dna3, dna4, dna5, dna6, dna10). Assuming '/Users/hyunjaecho/Desktop/code/DNABERT/src/transformers/dnabert-config/bert-config-10/vocab.txt' is a path, a model identifier, or url to a directory containing tokenizer files. 08/31/2022 15:33:27 - WARNING - transformers.tokenization_utils - Calling DNATokenizer.from_pretrained() with the path to a single file or url is deprecated 08/31/2022 15:33:27 - INFO - transformers.tokenization_utils - loading file /Users/hyunjaecho/Desktop/code/DNABERT/src/transformers/dnabert-config/bert-config-10/vocab.txt 08/31/2022 15:33:28 - INFO - main - Training new model from scratch 08/31/2022 15:34:12 - INFO - main - Training/evaluation parameters Namespace(adam_epsilon=1e-06, beta1=0.9, beta2=0.98, block_size=20480, cache_dir=None, config_name='/Users/hyunjaecho/Desktop/code/DNABERT/src/transformers/dnabert-config/bert-config-10/config.json', device=device(type='cpu'), do_eval=True, do_train=True, eval_all_checkpoints=False, eval_data_file='/Users/hyunjaecho/Desktop/code/unencoded/chrY.txt', evaluate_during_training=True, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=25, learning_rate=0.0004, line_by_line=True, local_rank=-1, logging_steps=500, max_grad_norm=1.0, max_steps=200000, mlm=True, mlm_probability=0.025, model_name_or_path=None, model_type='dna', n_gpu=0, n_process=24, no_cuda=False, num_train_epochs=1.0, output_dir='output10', overwrite_cache=False, overwrite_output_dir=True, per_gpu_eval_batch_size=6, per_gpu_train_batch_size=10, save_steps=500, save_total_limit=20, seed=42, server_ip='', server_port='', should_continue=False, tokenizer_name='/Users/hyunjaecho/Desktop/code/DNABERT/src/transformers/dnabert-config/bert-config-10/vocab.txt', train_data_file='/Users/hyunjaecho/Desktop/code/unencoded/chrY.txt', warmup_steps=10000, weight_decay=0.01) 08/31/2022 15:34:12 - INFO - main - Loading features from cached file /Users/hyunjaecho/Desktop/code/unencoded/dna_cached_lm_20480_chrY.txt 08/31/2022 15:34:12 - INFO - main - Running training 08/31/2022 15:34:12 - INFO - main - Num examples = 555 08/31/2022 15:34:12 - INFO - main - Num Epochs = 100001 08/31/2022 15:34:12 - INFO - main - Instantaneous batch size per GPU = 10 08/31/2022 15:34:12 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 250 08/31/2022 15:34:12 - INFO - main - Gradient Accumulation steps = 25 08/31/2022 15:34:12 - INFO - main - Total optimization steps = 200000 Iteration: 0%| | 0/56 [00:00<?, ?it/s] Epoch: 0%| | 0/100001 [00:00<?, ?it/s] Traceback (most recent call last): File "run_pretrain.py", line 888, in main() File "run_pretrain.py", line 838, in main global_step, tr_loss = train(args, train_dataset, model, tokenizer) File "run_pretrain.py", line 421, in train inputs, labels = mask_tokens(batch, tokenizer, args) if args.mlm else (batch, batch) File "run_pretrain.py", line 254, in mask_tokens mask_list = MASK_LIST[tokenizer.kmer] KeyError: '10'

smruti241 commented 1 year ago

I want to use kmer=26, for that how can I prepare the vocab.txt file? because I tried the above code for that, but the process has been killed

Anshullllllll commented 1 month ago

I also want to pretrain a model for kmer=9 but unable to do that please help... @jerryji1993

jerryji1993 / DNABERT