pmhalvor / music-genre-classifier

An experiment aimed to compare a range of ML-based music classifiers
0 stars 0 forks source link

Pretrained classifier #1

Open pmhalvor opened 8 months ago

pmhalvor commented 8 months ago

To round off the classifier experimenting, we want to test a pretrained music classifier, and compare the architecture against our simpler models.

Steps

pmhalvor commented 8 months ago

Simple classifications:

import numpy as np

pipe = pipeline("audio-classification", "mtg/maest...")

file_paths = np.load("../data/file_paths.npy")

dlabel = list(set([row.split("---")[0] for row in discogs_labels]))
glabel = set(np.load("labels.npy"))

dlabel_to_idx = {
    label:idx
    for (idx, label) in enumerate(dlabel)
}
glabel_to_dlabel = {
    # manually map
}

_, audio = wav.read(file_paths[0])

outputs = pipe(audio)
# [{"score":0.123, "label": "Electronic---Noise"}, ...]

predictions = np.zeros(len(dlabel_to_idx))

for output in outputs: 
    idx = dlabel_to_idx[output["label"]]
    if output["label"]  > predictons[idx]:
        predictions[idx] = output["label"]

criterion = CrossEntropyLoss(pipe.model.parameters(), lr=0.001)

loss = criterion(predictions, y) 
pmhalvor commented 8 months ago

After some experimenting with the above, I think it may be best to build a class from an ASTForAudioClassification instance loaded from pretrain. This is the same architecture MAEST was trained on.

By wrapping the training in a model class, we can easily handle the data transformation steps necessary to covert GTZAN labels to Discogs label outputs. This will allow us to compare against our other models tested.

It's recommended to freeze the pretrained layers, and only update the new layers during back propagation. But I'll have to experiment a bit and see what gives the best results.

The code will look something like this (though not exactly bc the below example is using a Pipeline base):

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score
from transformers import AutoModelForSequenceClassification, Wav2Vec2Tokenizer

class AudioClassificationPipeline(Pipeline):
    def __init__(self, model_name='wav2vec2-base-960'):
        super().__init__()
        self.scaler = StandardScaler()
        self.tokenizer = Wav2Vec2Tokenizer.from_pretrained(model_name)
        self.encoder = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=8)
        self.fc = LinearLayer(8, 8, activation='softmax')

    def fit(self, X, y):
        X_encoded = self.scaler.fit_transform(X)
        X_encoded = self.tokenizer.encode_plus(X_encoded, return_attention_mask=True, max_length=512, padding='max_length', truncation=True)
        X_encoded = self.encoder.encode_plus(X_encoded, return_attention_mask=True, max_length=512, padding='max_length', truncation=True)
        X_encoded = self.fc(X_encoded)
        return super().fit(X_encoded, y)

    def predict(self, X):
        X_encoded = self.scaler.transform(X)
        X_encoded = self.tokenizer.encode_plus(X_encoded, return_attention_mask=True, max_length=512, padding='max_length', truncation=True)
        X_encoded = self.encoder.encode_plus(X_encoded, return_attention_mask=True, max_length=512, padding='max_length', truncation=True)
        X_encoded = self.fc(X_encoded)
        return np.argmax(X_encoded, axis=1)