kuleshov-group / caduceus

Bi-Directional Equivariant Long-Range DNA Sequence Modeling
Apache License 2.0
137 stars 14 forks source link

Fine Tuning on custom datasets #25

Closed leannmlindsey closed 2 months ago

leannmlindsey commented 2 months ago

I would really like to be able to use your model to fine-tune on a dataset for a binary classification task (one that I have fine-tuned several other genomic language models on.)

I have tried two ways to do this but I have not yet been successful with either.

Method 1 - Huggingface Trainer:

I tried instantiating the model using "AutoModelForSequenceClassification

def train():
    parser = transformers.HfArgumentParser((ModelArguments, DataArguments, TrainingArguments))
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()

    tokenizer = transformers.AutoTokenizer.from_pretrained(
        model_args.model_name_or_path,
        model_max_length=training_args.model_max_length,
        padding_side="right",
        use_fast=True,
        trust_remote_code=True,
    )
    model = transformers.AutoModelForSequenceClassification.from_pretrained(
        model_args.model_name_or_path,
        num_labels=train_dataset.num_labels,
        trust_remote_code=True,
    )
Using this method I got this error:
    TIME: Start: = 2024-04-24 09:36:24
tokenizer_config.json: 100%|██████████| 1.48k/1.48k [00:00<00:00, 526kB/s]
tokenization_caduceus.py: 100%|██████████| 4.97k/4.97k [00:00<00:00, 2.02MB/s]
A new version of the following files was downloaded from https://huggingface.co/kuleshov-group/caduceus-ps_seqlen-1k_d_model-256_n_layer-4_lr-8e-3:
- tokenization_caduceus.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
special_tokens_map.json: 100%|██████████| 173/173 [00:00<00:00, 199kB/s]
WARNING:root:Perform single sequence classification...
Traceback (most recent call last):
  File "train.py", line 304, in <module>
    train()
  File "train.py", line 244, in train
    train_dataset = SupervisedDataset(tokenizer=tokenizer, 
  File "train.py", line 157, in __init__
    self.attention_mask = output["attention_mask"]
  File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/CADUCEUS_3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 253, in __getitem__
    return self.data[item]
KeyError: 'attention_mask'
TIME: End: = 2024-04-24 09:54:14

Perhaps the tokenizer is not returning all of the information the huggingface model is expecting?

This method has worked on many different huggingface models for me, so I am not sure what is different in this case.

Method 2 - Using the dataloader from the Genomic Benchmark code

In this case, I modified your code to add my task as a new task to the genomic benchmark and then formatted my input data to match the gb and added it as a new directory.

Here is the error that I get (I have also attached the logfile) RuntimeError: Trying to resize storage that is not resizable

Perhaps there is an easier way? I would appreciate any guidance you can provide. Thank you.

LeAnn

logfile.txt

leannmlindsey commented 2 months ago

Update: The above error in Method 2 was a problem with the data, which I have now fixed.