Open pmhalvor opened 8 months ago
Simple classifications:
import numpy as np
pipe = pipeline("audio-classification", "mtg/maest...")
file_paths = np.load("../data/file_paths.npy")
dlabel = list(set([row.split("---")[0] for row in discogs_labels]))
glabel = set(np.load("labels.npy"))
dlabel_to_idx = {
label:idx
for (idx, label) in enumerate(dlabel)
}
glabel_to_dlabel = {
# manually map
}
_, audio = wav.read(file_paths[0])
outputs = pipe(audio)
# [{"score":0.123, "label": "Electronic---Noise"}, ...]
predictions = np.zeros(len(dlabel_to_idx))
for output in outputs:
idx = dlabel_to_idx[output["label"]]
if output["label"] > predictons[idx]:
predictions[idx] = output["label"]
criterion = CrossEntropyLoss(pipe.model.parameters(), lr=0.001)
loss = criterion(predictions, y)
After some experimenting with the above, I think it may be best to build a class from an ASTForAudioClassification
instance loaded from pretrain. This is the same architecture MAEST was trained on.
By wrapping the training in a model class, we can easily handle the data transformation steps necessary to covert GTZAN labels to Discogs label outputs. This will allow us to compare against our other models tested.
It's recommended to freeze the pretrained layers, and only update the new layers during back propagation. But I'll have to experiment a bit and see what gives the best results.
The code will look something like this (though not exactly bc the below example is using a Pipeline base):
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score
from transformers import AutoModelForSequenceClassification, Wav2Vec2Tokenizer
class AudioClassificationPipeline(Pipeline):
def __init__(self, model_name='wav2vec2-base-960'):
super().__init__()
self.scaler = StandardScaler()
self.tokenizer = Wav2Vec2Tokenizer.from_pretrained(model_name)
self.encoder = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=8)
self.fc = LinearLayer(8, 8, activation='softmax')
def fit(self, X, y):
X_encoded = self.scaler.fit_transform(X)
X_encoded = self.tokenizer.encode_plus(X_encoded, return_attention_mask=True, max_length=512, padding='max_length', truncation=True)
X_encoded = self.encoder.encode_plus(X_encoded, return_attention_mask=True, max_length=512, padding='max_length', truncation=True)
X_encoded = self.fc(X_encoded)
return super().fit(X_encoded, y)
def predict(self, X):
X_encoded = self.scaler.transform(X)
X_encoded = self.tokenizer.encode_plus(X_encoded, return_attention_mask=True, max_length=512, padding='max_length', truncation=True)
X_encoded = self.encoder.encode_plus(X_encoded, return_attention_mask=True, max_length=512, padding='max_length', truncation=True)
X_encoded = self.fc(X_encoded)
return np.argmax(X_encoded, axis=1)
To round off the classifier experimenting, we want to test a pretrained music classifier, and compare the architecture against our simpler models.
Steps