I currently work on the project of "Autism gene classifier " which is a binary-classification system .. I have a gene dataset which have columns gene-symbol and syndromic ( 0 and 1) ..
The Model i am using is GPT-2 and while i run my code on google colab i face the error of IndexError: index out of range in self
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from transformers import GPT2Tokenizer, GPT2Model
import torch
Load your gene data (assuming you have a CSV file with 'gene_symbol' and 'syndromic' columns)
data = pd.read_csv('drive/MyDrive/Gene/sfari_genes.csv')
when i try to extract embedding from GPT-2 Model this index error came...
i also tried to maximize length as 512 and 1024 .. it wont work for me...
How to resolve this error ...please solve the error
I currently work on the project of "Autism gene classifier " which is a binary-classification system .. I have a gene dataset which have columns gene-symbol and syndromic ( 0 and 1) ..
The Model i am using is GPT-2 and while i run my code on google colab i face the error of IndexError: index out of range in self
This is my code
Install required libraries
!pip install torch !pip install transformers !pip install pandas !pip install scikit-learn
Import libraries
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score from transformers import GPT2Tokenizer, GPT2Model import torch
Load your gene data (assuming you have a CSV file with 'gene_symbol' and 'syndromic' columns)
data = pd.read_csv('drive/MyDrive/Gene/sfari_genes.csv')
Split the data into training and testing sets
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)
Initialize GPT-2 tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2') model = GPT2Model.from_pretrained('gpt2')
Add a new pad token
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
Tokenize and encode the gene symbols with a maximum length of 512 tokens
train_tokens = tokenizer(train_data['gene-symbol'].tolist(), padding=True, truncation=True, max_length=1024, return_tensors='pt') test_tokens = tokenizer(test_data['gene-symbol'].tolist(), padding=True, truncation=True, max_length=1024, return_tensors='pt')
Extract embeddings from GPT-2 model
model.eval() with torch.no_grad(): train_embeddings = model(train_tokens).last_hidden_state.mean(dim=1) test_embeddings = model(test_tokens).last_hidden_state.mean(dim=1)
Flatten the embeddings to be used as input to logistic regression
train_embeddings = train_embeddings.view(train_embeddings.size(0), -1) test_embeddings = test_embeddings.view(test_embeddings.size(0), -1)
Train logistic regression classifier
clf = LogisticRegression(random_state=42) clf.fit(train_embeddings, train_data['syndromic'])
Evaluate logistic regression classifier
train_predictions = clf.predict(train_embeddings) train_accuracy = accuracy_score(train_data['syndromic'], train_predictions) print("Training accuracy:", train_accuracy)
when i try to extract embedding from GPT-2 Model this index error came... i also tried to maximize length as 512 and 1024 .. it wont work for me... How to resolve this error ...please solve the error