simonw / llm

Access large language models from the command-line
https://llm.datasette.io
Apache License 2.0
4.77k stars 265 forks source link

`audio/wave` `.wav` files not supported #603

Closed NightMachinery closed 1 week ago

NightMachinery commented 2 weeks ago

I'm recording audio from my microphone using sox and saving the recordings as .wav files. When I try to attach these files to the gemini-1.5-flash-8b-latest model, I receive this error:

Error: This model does not support attachments of type 'audio/wave', only application/pdf, image/png, image/jpeg, image/webp, image/heic, image/heif, audio/wav, audio/mp3, audio/aiff, audio/aac, audio/ogg, audio/flac, audio/mpeg, video/mp4, video/mpeg, video/mov, video/avi, video/x-flv, video/mpg, video/webm, video/wmv, video/3gpp

I suspect the issue is simply that llm doesn't recognize that audio/wave and audio/wav are actually the same MIME type. Is this correct?

simonw commented 1 week ago

Yup, that's a bug - thanks. You can workaround it with the --at option which lets you specify the type directly:

 llm -m gemini-1.5-flash-latest --at output.wav audio/wav transcribe

Thanks for the tip about sox by the way, this worked for me on macOS:

brew install sox
sox -d output.wav                                                 
# Hit Ctrl+C when done
simonw commented 1 week ago

It looks like audio/wav is indeed the correct content type here. Not clear where audio/wave came from, but the library I'm using for content type detection - https://pypi.org/project/puremagic/ - apparently supports both wave https://github.com/cdgriffith/puremagic/blob/763349ec4d02ba930fb1142c6eb684afdf06c6ab/puremagic/magic_data.json#L103 and wav https://github.com/cdgriffith/puremagic/blob/763349ec4d02ba930fb1142c6eb684afdf06c6ab/puremagic/magic_data.json#L1118 and it looks like it detects audio/wave in preference for some reason.

simonw commented 1 week ago

puremagic uses data from https://www.garykessler.net/library/file_sigs.html - it lists two byte sequences for WAV

CleanShot 2024-11-07 at 16 21 18@2x

The first of those matches the puremagic definition of audio/wave, the second matches its audio/wav.

simonw commented 1 week ago

Interesting, the output.wav file I created using sox looks like this:

hexdump -C output.wav | head -n 4
00000000  52 49 46 46 48 e0 02 00  57 41 56 45 66 6d 74 20  |RIFFH...WAVEfmt |
00000010  28 00 00 00 fe ff 01 00  44 ac 00 00 10 b1 02 00  |(.......D.......|
00000020  04 00 20 00 16 00 20 00  04 00 00 00 01 00 00 00  |.. ... .........|
00000030  00 00 10 00 80 00 00 aa  00 38 9b 71 66 61 63 74  |.........8.qfact|

Which is BOTH of the lines in the file_sigs.html thing, so maybe I misinterpreted that and there is only one audio/wave file format and it's that?

In which case, why does puremagic have those two sequences listed separately in their magic_data.json file?

simonw commented 1 week ago

This file in the puremagic tests has the same header: https://github.com/cdgriffith/puremagic/blob/master/test/resources/audio/test.wav

That's one of four audio files in the tests https://github.com/cdgriffith/puremagic/tree/master/test/resources/audio - and the only assertion it runs is that the file extension .wav is correctly determined: https://github.com/cdgriffith/puremagic/blob/763349ec4d02ba930fb1142c6eb684afdf06c6ab/test/test_common_extensions.py#L43-L49

simonw commented 1 week ago

Filed an issue here:

But seeing as IANA doesn't list either audio/wav or audio/wave on https://www.iana.org/assignments/media-types/media-types.xhtml#audio it's not clear that there IS a correct answer here!

simonw commented 1 week ago

Also relevant:

python -c 'import puremagic, pprint, sys; pprint.pprint(puremagic.magic_stream(open(sys.argv[-1], "rb")))' output.wav
[PureMagicWithConfidence(byte_match=b'RIFFH\xe0\x02\x00WAVE', offset=8, extension='.wav', mime_type='audio/wave', name='Waveform Audio File Format', confidence=0.8),
 PureMagicWithConfidence(byte_match=b'WAVEfmt ', offset=8, extension='.wav', mime_type='audio/x-wav', name='Windows audio file ', confidence=0.8),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.4xm', mime_type='', name='4X Movie video', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.cdr', mime_type='', name='CorelDraw document', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.avi', mime_type='video/avi', name='Resource Interchange File Format', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.cda', mime_type='', name='Resource Interchange File Format', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.qcp', mime_type='audio/vnd.qcelp', name='Resource Interchange File Format', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.rmi', mime_type='audio/mid', name='Resource Interchange File Format', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.wav', mime_type='audio/wav', name='Resource Interchange File Format', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.ds4', mime_type='', name='Micrografx Designer graphic', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.ani', mime_type='application/x-navi-animation', name='Windows animated cursor', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.dat', mime_type='video/mpeg', name='Video CD MPEG movie', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.cmx', mime_type='', name='Corel Presentation Exchange metadata', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.webp', mime_type='image/webp', name='RIFF WebP', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'WAVE', offset=8, extension='.wav', mime_type='audio/x-wav', name='WAV audio', confidence=0.4)]
simonw commented 1 week ago

For the moment I'm going to take the opinion that audio/wav is correct and have LLM treat audio/wave as audio/wav in core. I'll change that if it turns out to be a mistake in the future.

simonw commented 1 week ago

This works:

llm -m gemini-1.5-flash-latest -a output.wav transcribe

This is a quick test that I'm doing

NightMachinery commented 1 week ago

Thanks! ❤️ So llm detects the MIME type and hardcodes it for the API call? How does llm know if the API accepts some MIME or not?

simonw commented 1 week ago

Each plugin defines the list of accepted mime type like this:

https://github.com/simonw/llm/blob/5d1d723d4beb546eab4deb8bb8f740b2fe20e065/llm/default_plugins/openai_models.py#L315-L333

Full docs here: https://llm.datasette.io/en/stable/plugins/advanced-model-plugins.html#attachments-for-multi-modal-models