I would really like to be able to use your model to fine-tune on a dataset for a binary classification task (one that I have fine-tuned several other genomic language models on.)
I have tried two ways to do this but I have not yet been successful with either.
Method 1 - Huggingface Trainer:
I tried instantiating the model using "AutoModelForSequenceClassification
TIME: Start: = 2024-04-24 09:36:24
tokenizer_config.json: 100%|██████████| 1.48k/1.48k [00:00<00:00, 526kB/s]
tokenization_caduceus.py: 100%|██████████| 4.97k/4.97k [00:00<00:00, 2.02MB/s]
A new version of the following files was downloaded from https://huggingface.co/kuleshov-group/caduceus-ps_seqlen-1k_d_model-256_n_layer-4_lr-8e-3:
- tokenization_caduceus.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
special_tokens_map.json: 100%|██████████| 173/173 [00:00<00:00, 199kB/s]
WARNING:root:Perform single sequence classification...
Traceback (most recent call last):
File "train.py", line 304, in <module>
train()
File "train.py", line 244, in train
train_dataset = SupervisedDataset(tokenizer=tokenizer,
File "train.py", line 157, in __init__
self.attention_mask = output["attention_mask"]
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/CADUCEUS_3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 253, in __getitem__
return self.data[item]
KeyError: 'attention_mask'
TIME: End: = 2024-04-24 09:54:14
Perhaps the tokenizer is not returning all of the information the huggingface model is expecting?
This method has worked on many different huggingface models for me, so I am not sure what is different in this case.
Method 2 - Using the dataloader from the Genomic Benchmark code
In this case, I modified your code to add my task as a new task to the genomic benchmark and then formatted my input data to match the gb and added it as a new directory.
Here is the error that I get (I have also attached the logfile)
RuntimeError: Trying to resize storage that is not resizable
Perhaps there is an easier way? I would appreciate any guidance you can provide. Thank you.
I would really like to be able to use your model to fine-tune on a dataset for a binary classification task (one that I have fine-tuned several other genomic language models on.)
I have tried two ways to do this but I have not yet been successful with either.
Method 1 - Huggingface Trainer:
I tried instantiating the model using "AutoModelForSequenceClassification
Perhaps the tokenizer is not returning all of the information the huggingface model is expecting?
This method has worked on many different huggingface models for me, so I am not sure what is different in this case.
Method 2 - Using the dataloader from the Genomic Benchmark code
In this case, I modified your code to add my task as a new task to the genomic benchmark and then formatted my input data to match the gb and added it as a new directory.
Here is the error that I get (I have also attached the logfile) RuntimeError: Trying to resize storage that is not resizable
Perhaps there is an easier way? I would appreciate any guidance you can provide. Thank you.
LeAnn
logfile.txt