[CLAP] Acc drop after converting "HTSAT-base" type from origin model to huggingface model

happylittlecat2333 commented 11 months ago

System Info

Question Description

I want to use CLAP in huggingface model style but only find "laion/clap-htsat-unfused" and "laion/clap-htsat-fused" in huggingface Models. However, I wish to use the music CLAP model, which are recently updated in https://github.com/LAION-AI/CLAP, such as music_speech_epoch_15_esc_89.25.pt, so I find convert_clap_original_pytorch_to_hf.py to convert the clap model. But I find that the newly update model(like music_speech_audioset_epoch_15_esc_89.98.pt) are based on HTSAT-base model, the hidden_size and patch_embeds_hidden_size are different. So I revise the convert_clap_original_pytorch_to_hf.py to below. But after test three model( including HTSAT-base and HTSAT-tiny based model), I find Acc drop for HTSAT-base model, can you please help me find out the problem, and maybe upload huggingface version of CLAP model for newly updated CLAP model in original repo, and maybe give a new PR to be compatible with both HTSAT-base and HTSAT-tiny?

Who can help?

@ArthurZucker @younesbelkada

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

My revised `convert_clap_original_pytorch_to_hf.py`

import argparse
import re

import torch
# from CLAP import create_model
from laion_clap.clap_module import create_model

from transformers import AutoFeatureExtractor, ClapConfig, ClapModel, ClapAudioConfig, ClapProcessor

KEYS_TO_MODIFY_MAPPING = {
    "text_branch": "text_model",
    "audio_branch": "audio_model.audio_encoder",
    "attn": "attention.self",
    "self.proj": "output.dense",
    "attention.self_mask": "attn_mask",
    "mlp.fc1": "intermediate.dense",
    "mlp.fc2": "output.dense",
    "norm1": "layernorm_before",
    "norm2": "layernorm_after",
    "bn0": "batch_norm",
}

processor = ClapProcessor.from_pretrained("laion/clap-htsat-unfused", truncation="rand_trunc")

# ADDED
CLAP_AUDIO_CONFIG_DICT = {
    "HTSAT-tiny": {},
    "HTSAT-base": {
        "hidden_size": 1024,
        "patch_embeds_hidden_size": 128,
    }
}

def init_clap(checkpoint_path, amodel="HTSAT-tiny", enable_fusion=False):
    model, model_cfg = create_model(
        amodel,
        "roberta",
        checkpoint_path,
        precision="fp32",
        device="cuda:0" if torch.cuda.is_available() else "cpu",
        enable_fusion=enable_fusion,
        fusion_type="aff_2d" if enable_fusion else None,
    )
    return model, model_cfg

def rename_state_dict(state_dict):
    model_state_dict = {}

    sequential_layers_pattern = r".*sequential.(\d+).*"
    text_projection_pattern = r".*_projection.(\d+).*"

    for key, value in state_dict.items():
        # check if any key needs to be modified
        for key_to_modify, new_key in KEYS_TO_MODIFY_MAPPING.items():
            if key_to_modify in key:
                key = key.replace(key_to_modify, new_key)

        if re.match(sequential_layers_pattern, key):
            # replace sequential layers with list
            sequential_layer = re.match(sequential_layers_pattern, key).group(1)

            key = key.replace(f"sequential.{sequential_layer}.", f"layers.{int(sequential_layer)//3}.linear.")
        elif re.match(text_projection_pattern, key):
            projecton_layer = int(re.match(text_projection_pattern, key).group(1))

            # Because in CLAP they use `nn.Sequential`...
            transformers_projection_layer = 1 if projecton_layer == 0 else 2

            key = key.replace(f"_projection.{projecton_layer}.", f"_projection.linear{transformers_projection_layer}.")

        if "audio" and "qkv" in key:
            # split qkv into query key and value
            mixed_qkv = value
            qkv_dim = mixed_qkv.size(0) // 3

            query_layer = mixed_qkv[:qkv_dim]
            key_layer = mixed_qkv[qkv_dim : qkv_dim * 2]
            value_layer = mixed_qkv[qkv_dim * 2 :]

            model_state_dict[key.replace("qkv", "query")] = query_layer
            model_state_dict[key.replace("qkv", "key")] = key_layer
            model_state_dict[key.replace("qkv", "value")] = value_layer
        else:
            model_state_dict[key] = value

    return model_state_dict

def convert_clap_checkpoint(checkpoint_path, pytorch_dump_folder_path, config_path, amodel, enable_fusion=False):
    clap_model, clap_model_cfg = init_clap(checkpoint_path, amodel=amodel, enable_fusion=enable_fusion)

    clap_model.eval()
    state_dict = clap_model.state_dict()
    state_dict = rename_state_dict(state_dict)

    # ADDED
    clap_audio_config = CLAP_AUDIO_CONFIG_DICT[amodel]

    transformers_config = ClapConfig(audio_config=clap_audio_config)
    transformers_config.audio_config.enable_fusion = enable_fusion
    model = ClapModel(transformers_config)

    # ignore the spectrogram embedding layer
    model.load_state_dict(state_dict, strict=False)

    model.save_pretrained(pytorch_dump_folder_path)
    transformers_config.save_pretrained(pytorch_dump_folder_path)
    processor.save_pretrained(pytorch_dump_folder_path)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model.")
    parser.add_argument("--checkpoint_path", default=None, type=str, help="Path to fairseq checkpoint")
    parser.add_argument("--config_path", default=None, type=str, help="Path to hf config.json of model to convert")
    parser.add_argument("--amodel", default="HTSAT-tiny", type=str, help="Whether to enable fusion or not")
    parser.add_argument("--enable_fusion", action="store_true", help="Whether to enable fusion or not")
    args = parser.parse_args()

    convert_clap_checkpoint(args.checkpoint_path, args.pytorch_dump_folder_path, args.config_path, args.amodel, args.enable_fusion)

convert script:

python convert_clap_original_pytorch_to_hf.py \
    --pytorch_dump_folder_path ./clap-htsat-base-unfused-music-audioset \
    --checkpoint_path ./pretrained_model/music_audioset_epoch_15_esc_90.14.pt \
    --config_path ./clap-htsat-base-unfused-music-audioset/config.json \
    --amodel HTSAT-base

python convert_clap_original_pytorch_to_hf.py \
    --pytorch_dump_folder_path ./clap-htsat-base-unfused-music-speech-audioset \
    --checkpoint_path ./pretrained_model/music_speech_audioset_epoch_15_esc_89.98.pt \
    --config_path ./clap-htsat-base-unfused-music-speech-audioset/config.json \
    --amodel HTSAT-base 

python convert_clap_original_pytorch_to_hf.py \
    --pytorch_dump_folder_path ./630k-audioset-best \
    --checkpoint_path ./pretrained_model/630k-audioset-best.pt \
    --config_path ./630k-audioset-best/config.json \
    --amodel HTSAT-tiny

My evalute on ESC50 adopted by clap eval code in original repo esc50_api.py

import glob
import json
import torch
import numpy as np
from transformers import ClapModel, ClapProcessor
import librosa

device = torch.device('cuda:0')

# download https://drive.google.com/drive/folders/1scyH43eQAcrBz-5fAw44C6RNBhC3ejvX?usp=sharing and extract ./ESC50_1/test/0.tar to ./ESC50_1/test/
esc50_test_dir = './ESC50_1/test/*/'
class_index_dict_path = './class_labels/ESC50_class_labels_indices_space.json'

# Load the model (for different converted model)
pretrained_model_path = "./clap-htsat-base-unfused-music-speech-audioset"
# pretrained_model_path = "./clap-htsat-base-unfused-music-audioset"
# pretrained_model_path = "./630k-audioset-best"
# pretrained_model_path = "laion/clap-htsat-unfused"
processor = ClapProcessor.from_pretrained(pretrained_model_path)
model = ClapModel.from_pretrained(pretrained_model_path)

# Get the class index dict
class_index_dict = {v: k for v, k in json.load(open(class_index_dict_path)).items()}

# Get all the data
audio_files = sorted(glob.glob(esc50_test_dir + '**/*.flac', recursive=True))
json_files = sorted(glob.glob(esc50_test_dir + '**/*.json', recursive=True))

print("audio_files: ", len(audio_files))
print("json_files: ", len(json_files))

ground_truth_idx = [class_index_dict[json.load(open(jf))['tag'][0]] for jf in json_files]

with torch.no_grad():
    ground_truth = torch.tensor(ground_truth_idx).view(-1, 1)

    # Get text features
    all_texts = ["This is a sound of " + t for t in class_index_dict.keys()]

    inputs = processor(text=all_texts, return_tensors="pt", padding=True)
    text_embed = model.get_text_features(**inputs)
    print("text_embed: ", text_embed.shape)

    audio_input = []
    for audio_file in audio_files:
        audio_waveform, _ = librosa.load(audio_file, sr=48000)
        audio_input.append(audio_waveform)

    inputs = processor(audios=audio_input, return_tensors="pt", padding=True, sampling_rate=48000)
    audio_embed = model.get_audio_features(**inputs)

    print("audio_embed: ", audio_embed.shape)

    # audio_embed = model.get_audio_embedding_from_filelist(x=audio_files)

    ranking = torch.argsort(torch.tensor(audio_embed) @ torch.tensor(text_embed).t(), descending=True)
    preds = torch.where(ranking == ground_truth)[1]
    preds = preds.cpu().numpy()

    metrics = {}
    metrics[f"mean_rank"] = preds.mean() + 1
    metrics[f"median_rank"] = np.floor(np.median(preds)) + 1
    for k in [1, 5, 10]:
        metrics[f"R@{k}"] = np.mean(preds < k)
    # map@10
    metrics[f"mAP@10"] = np.mean(np.where(preds < 10, 1 / (preds + 1), 0.0))

    print(
        f"Zeroshot Classification Results: "
        + "\t".join([f"{k}: {round(v, 4):.4f}" for k, v in metrics.items()])
    )

Expected behavior

Evaluate Result

630k-audioset-best (before convert, HTSAT-tiny type)

Zeroshot Classification Results: mean_rank: 1.1450 median_rank: 1.0000 R@1: 0.9275 R@5: 0.9975 R@10: 1.0000 mAP@10: 0.9556
630k-audioset-best (after convert, HTSAT-tiny type)

Zeroshot Classification Results: mean_rank: 1.1850 median_rank: 1.0000 R@1: 0.9000 R@5: 0.9975 R@10: 1.0000 mAP@10: 0.9400
music_audioset_epoch_15_esc_90.14 (before convert, HTSAT-base type)

Zeroshot Classification Results: mean_rank: 1.1850 median_rank: 1.0000 R@1: 0.9175 R@5: 0.9950 R@10: 0.9975 mAP@10: 0.9513
music_audioset_epoch_15_esc_90.14 (after convert, HTSAT-base type)

Zeroshot Classification Results: mean_rank: 3.2800 median_rank: 2.0000 R@1: 0.4700 R@5: 0.8425 R@10: 0.9300 mAP@10: 0.6312
music_speech_audioset_epoch_15_esc_89.98 (before convert, HTSAT-base type)

Zeroshot Classification Results: mean_rank: 1.1450 median_rank: 1.0000 R@1: 0.9275 R@5: 0.9900 R@10: 1.0000 mAP@10: 0.9568
music_speech_audioset_epoch_15_esc_89.98 (after convert, HTSAT-base type)

Zeroshot Classification Results: mean_rank: 3.3575 median_rank: 1.0000 R@1: 0.5200 R@5: 0.8375 R@10: 0.9325 mAP@10: 0.6491

Therefore, we can see that HTSAT-base type have Acc drop after converting to huggingface type, could you please help us figure out this bug, and maybe upload huggingface version of CLAP model for music_speech_epoch_15_esc_89.25.pt, music_speech_audioset_epoch_15_esc_89.98.pt, and and maybe give a new PR to be compatible with both HTSAT-base and HTSAT-tiny? Thanks!

LysandreJik commented 11 months ago

cc @younesbelkada

ylacombe commented 10 months ago

Hi @happylittlecat2333, thanks for the very thorough analysis here!

I've opened a PR (#27153) to convert the weights from the new clap checkpoints. I believe that you missed some parameters when you converted the weights!

You can find the converted weights (here, here and here - yet to be moved to laion organization). Would you mind running your benchmark on it again ? Thanks!

happylittlecat2333 commented 10 months ago

Great Job!!! The converted models have the similar results with the new clap checkpoints!

Below is my result after converting the models.

Evaluate Result

630k-audioset-best before convert, HTSAT-tiny type

Zeroshot Classification Results: mean_rank: 1.1450 median_rank: 1.0000 R@1: 0.9275 R@5: 0.9975 R@10: 1.0000 mAP@10: 0.9556

630k-audioset-best (after convert, HTSAT-tiny type)

Zeroshot Classification Results: mean_rank: 1.1850 median_rank: 1.0000 R@1: 0.9000 R@5: 0.9975 R@10: 1.0000 mAP@10: 0.9400

music_audioset_epoch_15_esc_90.14 (before convert, HTSAT-base type)

Zeroshot Classification Results: mean_rank: 1.1850 median_rank: 1.0000 R@1: 0.9175 R@5: 0.9950 R@10: 0.9975 mAP@10: 0.9513

music_audioset_epoch_15_esc_90.14 (after convert, HTSAT-base type)

Zeroshot Classification Results: mean_rank: 1.2325 median_rank: 1.0000 R@1: 0.9100 R@5: 0.9900 R@10: 0.9950 mAP@10: 0.9467

music_speech_audioset_epoch_15_esc_89.98 (before convert, HTSAT-base type)

Zeroshot Classification Results: mean_rank: 1.1450 median_rank: 1.0000 R@1: 0.9275 R@5: 0.9900 R@10: 1.0000 mAP@10: 0.9568

music_speech_audioset_epoch_15_esc_89.98 (after convert, HTSAT-base type)

Zeroshot Classification Results: mean_rank: 1.1100 median_rank: 1.0000 R@1: 0.9350 R@5: 0.9975 R@10: 1.0000 mAP@10: 0.9622

PS: I converted the models using PR (https://github.com/huggingface/transformers/pull/27153), and the converted model work great! But I find preprocessor config and tokenizer config are not saved, including preprocessor_config.json, special_tokens_map.json, tokenizer_config.json, tokenizer.json and vocab.json. It will be perfect if the converting code incude the whole saving process!

Thanks for your wonderful work!!

ylacombe commented 10 months ago

Hey @happylittlecat2333, many thanks for running the benchmark so promptly! Happy to see that it fixed the benchmark ! I will merge the PR asap!

PS: I converted the models using PR (https://github.com/huggingface/transformers/pull/27153), and the converted model work great! But I find preprocessor config and tokenizer config are not saved, including preprocessor_config.json, special_tokens_map.json, tokenizer_config.json, tokenizer.json and vocab.json. It will be perfect if the converting code incude the whole saving process!

I've manually added the processor (feature extractor and tokenizer) to the repos, as it was the same than the previous checkpoints! For now, I'll leave the PR as it is, but I keep that in mind if the issue appears again!

BTW, you can now find the weights (including the processor configs) in the LAION organization on the hub - here, here and here. Feel free to use these checkpoints if you use them again!

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ScottishFold007 commented 4 months ago

Hey @happylittlecat2333, many thanks for running the benchmark so promptly! Happy to see that it fixed the benchmark ! I will merge the PR asap!

PS: I converted the models using PR (#27153), and the converted model work great! But I find preprocessor config and tokenizer config are not saved, including preprocessor_config.json, special_tokens_map.json, tokenizer_config.json, tokenizer.json and vocab.json. It will be perfect if the converting code incude the whole saving process!

I've manually added the processor (feature extractor and tokenizer) to the repos, as it was the same than the previous checkpoints! For now, I'll leave the PR as it is, but I keep that in mind if the issue appears again!

BTW, you can now find the weights (including the processor configs) in the LAION organization on the hub - here, here and here. Feel free to use these checkpoints if you use them again!

Hi, can you help convert the Microsoft mclap model (https://huggingface.co/microsoft/msclap/tree/main)? This model has been trained on a huge number of audio-text pairs and actually works better than the original clap, but the architecture of the model is different from the previous one.

huggingface / transformers