Questions about experimental code

yongrenr commented 3 months ago

Hello, I'm very interested in your model, about the genome benchmark, I operated through your guidance, and there were 2 problems - the dataloader length is 0 and the loss is infinite, I don't know if this is normal, can you help confirm what the reason is? Q1: RUN: python -m train \ experiment=hg38/genomic_benchmark \ callbacks.model_checkpoint_every_n_steps.every_n_train_steps=5000 \ dataset.dataset_name="dummy_mouse_enhancers_ensembl" \ dataset.train_val_split_seed=1 \ dataset.batch_size=128 \ dataset.rc_aug=false \ +dataset.conjoin_train=false \ +dataset.conjoin_test=false \ loader.num_workers=2 \ model=caduceus \ model.name=dna_embedding_caduceus \ +model.config_path="" \ +model.conjoin_test=false \ +decoder.conjoin_train=true \ +decoder.conjoin_test=false \ optimizer.lr="1e-3" \ trainer.max_epochs=10 \ train.pretrained_model_path="<path to .ckpt file>" \ wandb=null ERROR: 63a3b2e1c7a3bbe3a703caaff47a150

Q2: RUN: python -m train \ experiment=hg38/hg38 \ callbacks.model_checkpoint_every_n_steps.every_n_train_steps=500 \ dataset.max_length=1024 \ dataset.batch_size=1024 \ dataset.mlm=true \ dataset.mlm_probability=0.15 \ dataset.rc_aug=false \ model=caduceus \ model.config.d_model=128 \ model.config.n_layer=4 \ model.config.bidirectional=true \ model.config.bidirectional_strategy=add \ model.config.bidirectional_weight_tie=true \ model.config.rcps=true \ optimizer.lr="8e-3" \ train.global_batch_size=8 \ trainer.max_steps=10000 \ +trainer.val_check_interval=10000 \ wandb=null ERROR: Result

yair-schiff commented 3 months ago

Regarding Q1, this is an error I haven't hit before. Can you provide a bit more of the console output. Also it looks like these two fields are empty in the command you used to launch. They need to be filled with arguments that correspond to a pre-trained model.

+model.config_path=""
train.pretrained_model_path="<path to .ckpt file>" `

Regarding Q2, can you post the LR and training loss graphs from wandb? Did the model ever hit a nan loss during training?

yongrenr commented 3 months ago

Q1:Sorry, it's my fault. The code I uploaded has issues, and here are more error screenshots. RUN： python -m train \ experiment=hg38/genomic_benchmark \ callbacks.model_checkpoint_every_n_steps.every_n_train_steps=5000 \ dataset.dataset_name="human_nontata_promoters" \ dataset.train_val_split_seed=2 \ dataset.batch_size=128 \ dataset.rc_aug=false \ +dataset.conjoin_train=false \ +dataset.conjoin_test=false \ loader.num_workers=2 \ model=caduceus \ model.name=dna_embedding_caduceus \ +model.config_path="/home/gyc/caduceus-main/outputs/2024-03-11/20-21-19-995417/model_config.json" \ +model.conjoin_test=false \ +decoder.conjoin_train=true \ +decoder.conjoin_test=false \ optimizer.lr="1e-3" \ trainer.max_epochs=10 \ train.pretrained_model_path="/home/gyc/caduceus-main/outputs/2024-03-11/20-21-19-995417/checkpoints/last.ckpt" \ wandb=null ERROR： 20dbb86255d3d98ce62af97445921a4 b0541d22a1390ab6afa60874e996d18

yair-schiff commented 3 months ago

I just tried running this and did not hit the division by zero error. Can you confirm that data was properly downloaded to ./data/genomic_benchmark/human_nontata_promoters/ by the genomics-benchmark library:

This directory should look like this

data/genomic_benchmark/human_nontata_promoters/
├── test
│   ├── negative
│   └── positive
└── train
    ├── negative
    └── positive

these directories should contain .txt files with sequences.

yongrenr commented 3 months ago

Thanks for the reminder, I've successfully run your code and it works great!

yair-schiff commented 3 months ago

Glad to hear it!

kuleshov-group / caduceus

Questions about experimental code #3