huggingface / course

The Hugging Face course on Transformers
https://huggingface.co/course
Apache License 2.0
2.15k stars 696 forks source link

IndexError: index out of range in self #686

Open Santhu489 opened 6 months ago

Santhu489 commented 6 months ago

I currently work on the project of "Autism gene classifier " which is a binary-classification system .. I have a gene dataset which have columns gene-symbol and syndromic ( 0 and 1) ..

The Model i am using is GPT-2 and while i run my code on google colab i face the error of IndexError: index out of range in self

This is my code

Install required libraries

!pip install torch !pip install transformers !pip install pandas !pip install scikit-learn

Import libraries

import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score from transformers import GPT2Tokenizer, GPT2Model import torch

Load your gene data (assuming you have a CSV file with 'gene_symbol' and 'syndromic' columns)

data = pd.read_csv('drive/MyDrive/Gene/sfari_genes.csv')

Split the data into training and testing sets

train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

Initialize GPT-2 tokenizer and model

tokenizer = GPT2Tokenizer.from_pretrained('gpt2') model = GPT2Model.from_pretrained('gpt2')

Add a new pad token

tokenizer.add_special_tokens({'pad_token': '[PAD]'})

Tokenize and encode the gene symbols with a maximum length of 512 tokens

train_tokens = tokenizer(train_data['gene-symbol'].tolist(), padding=True, truncation=True, max_length=1024, return_tensors='pt') test_tokens = tokenizer(test_data['gene-symbol'].tolist(), padding=True, truncation=True, max_length=1024, return_tensors='pt')

Extract embeddings from GPT-2 model

model.eval() with torch.no_grad(): train_embeddings = model(train_tokens).last_hidden_state.mean(dim=1) test_embeddings = model(test_tokens).last_hidden_state.mean(dim=1)

Flatten the embeddings to be used as input to logistic regression

train_embeddings = train_embeddings.view(train_embeddings.size(0), -1) test_embeddings = test_embeddings.view(test_embeddings.size(0), -1)

Train logistic regression classifier

clf = LogisticRegression(random_state=42) clf.fit(train_embeddings, train_data['syndromic'])

Evaluate logistic regression classifier

train_predictions = clf.predict(train_embeddings) train_accuracy = accuracy_score(train_data['syndromic'], train_predictions) print("Training accuracy:", train_accuracy)

when i try to extract embedding from GPT-2 Model this index error came... i also tried to maximize length as 512 and 1024 .. it wont work for me... How to resolve this error ...please solve the error

p0lyMth commented 1 month ago

@Santhu489, if you want help, you need to provide a small dataset sample of sfari_genes.csv for reproducibility. The code snippet

data = pd.read_csv('drive/MyDrive/Gene/sfari_genes.csv')

is not enough.