nyrahealth / CrisperWhisper

Verbatim Automatic Speech Recognition with improved word-level timestamps and filler detection
Other
395 stars 16 forks source link

no dysfluencies #10

Open bisserai opened 1 month ago

bisserai commented 1 month ago

Hello,

First of all thanks for developing this tool and making it available ! I'm trying to use crisperwhisper to annotate a naturalistic language production experiment in german. The files are 1mn each, and there's a single speaker per recording, answering to open-ended questions. I'm running the below on cpu and only get one dysfluency at best in an 17s excerpt of which contains 4 ehms) and can't find a way to improve this. My colleague tried running on a server and it's much faster but not improving the dysfluencies.

Thanks for your help ! bissera

i have a macbook pro with a 2,3 GHz Quad-Core Intel Core i7 processor, 32gb of ram and running sonoma 14.4.1. my ide is vscode 1.94.1

the code in my jupyter notebook:

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "nyrahealth/CrisperWhisper"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
    ).to(device)
processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps='word',
    torch_dtype=torch_dtype,
    device=device,
    generate_kwargs = {"language":"<|de|>","task": "transcribe"}
)
LaurinmyReha commented 3 weeks ago

Hello,

Well thank you for using it. Your code looks good. The problem is that for german we had a lot less disfluencies and largely worked with synthetic data and hoped that this ability to detect disfluencies would transfer over from english. We have also observed that for german disfluency detection is not satisfactory yet. We have however now constructed a Dataset containing over 100000 annotated fillers for german and will retrain CrisperWhisper soon so i hope the updated version will be more helpful for german. In the meanwhile i have seen some improvements by increasing the beam size and checking out beams that are a bit less probable for german ( perhaps in your case taking the ones containing the most fillers could be a simple heuristic which could improve what you are looking for)