Closed NightMachinery closed 1 week ago
Yup, that's a bug - thanks. You can workaround it with the --at
option which lets you specify the type directly:
llm -m gemini-1.5-flash-latest --at output.wav audio/wav transcribe
Thanks for the tip about sox
by the way, this worked for me on macOS:
brew install sox
sox -d output.wav
# Hit Ctrl+C when done
It looks like audio/wav
is indeed the correct content type here. Not clear where audio/wave
came from, but the library I'm using for content type detection - https://pypi.org/project/puremagic/ - apparently supports both wave https://github.com/cdgriffith/puremagic/blob/763349ec4d02ba930fb1142c6eb684afdf06c6ab/puremagic/magic_data.json#L103 and wav https://github.com/cdgriffith/puremagic/blob/763349ec4d02ba930fb1142c6eb684afdf06c6ab/puremagic/magic_data.json#L1118 and it looks like it detects audio/wave
in preference for some reason.
puremagic uses data from https://www.garykessler.net/library/file_sigs.html - it lists two byte sequences for WAV
The first of those matches the puremagic definition of audio/wave
, the second matches its audio/wav
.
Interesting, the output.wav
file I created using sox
looks like this:
hexdump -C output.wav | head -n 4
00000000 52 49 46 46 48 e0 02 00 57 41 56 45 66 6d 74 20 |RIFFH...WAVEfmt |
00000010 28 00 00 00 fe ff 01 00 44 ac 00 00 10 b1 02 00 |(.......D.......|
00000020 04 00 20 00 16 00 20 00 04 00 00 00 01 00 00 00 |.. ... .........|
00000030 00 00 10 00 80 00 00 aa 00 38 9b 71 66 61 63 74 |.........8.qfact|
Which is BOTH of the lines in the file_sigs.html
thing, so maybe I misinterpreted that and there is only one audio/wave
file format and it's that?
In which case, why does puremagic
have those two sequences listed separately in their magic_data.json
file?
This file in the puremagic
tests has the same header: https://github.com/cdgriffith/puremagic/blob/master/test/resources/audio/test.wav
That's one of four audio files in the tests https://github.com/cdgriffith/puremagic/tree/master/test/resources/audio - and the only assertion it runs is that the file extension .wav
is correctly determined: https://github.com/cdgriffith/puremagic/blob/763349ec4d02ba930fb1142c6eb684afdf06c6ab/test/test_common_extensions.py#L43-L49
Filed an issue here:
But seeing as IANA doesn't list either audio/wav
or audio/wave
on https://www.iana.org/assignments/media-types/media-types.xhtml#audio it's not clear that there IS a correct answer here!
Also relevant:
python -c 'import puremagic, pprint, sys; pprint.pprint(puremagic.magic_stream(open(sys.argv[-1], "rb")))' output.wav
[PureMagicWithConfidence(byte_match=b'RIFFH\xe0\x02\x00WAVE', offset=8, extension='.wav', mime_type='audio/wave', name='Waveform Audio File Format', confidence=0.8),
PureMagicWithConfidence(byte_match=b'WAVEfmt ', offset=8, extension='.wav', mime_type='audio/x-wav', name='Windows audio file ', confidence=0.8),
PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.4xm', mime_type='', name='4X Movie video', confidence=0.4),
PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.cdr', mime_type='', name='CorelDraw document', confidence=0.4),
PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.avi', mime_type='video/avi', name='Resource Interchange File Format', confidence=0.4),
PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.cda', mime_type='', name='Resource Interchange File Format', confidence=0.4),
PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.qcp', mime_type='audio/vnd.qcelp', name='Resource Interchange File Format', confidence=0.4),
PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.rmi', mime_type='audio/mid', name='Resource Interchange File Format', confidence=0.4),
PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.wav', mime_type='audio/wav', name='Resource Interchange File Format', confidence=0.4),
PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.ds4', mime_type='', name='Micrografx Designer graphic', confidence=0.4),
PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.ani', mime_type='application/x-navi-animation', name='Windows animated cursor', confidence=0.4),
PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.dat', mime_type='video/mpeg', name='Video CD MPEG movie', confidence=0.4),
PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.cmx', mime_type='', name='Corel Presentation Exchange metadata', confidence=0.4),
PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.webp', mime_type='image/webp', name='RIFF WebP', confidence=0.4),
PureMagicWithConfidence(byte_match=b'WAVE', offset=8, extension='.wav', mime_type='audio/x-wav', name='WAV audio', confidence=0.4)]
For the moment I'm going to take the opinion that audio/wav
is correct and have LLM treat audio/wave
as audio/wav
in core. I'll change that if it turns out to be a mistake in the future.
This works:
llm -m gemini-1.5-flash-latest -a output.wav transcribe
This is a quick test that I'm doing
Thanks! ❤️ So llm
detects the MIME type and hardcodes it for the API call? How does llm
know if the API accepts some MIME or not?
Each plugin defines the list of accepted mime type like this:
Full docs here: https://llm.datasette.io/en/stable/plugins/advanced-model-plugins.html#attachments-for-multi-modal-models
I'm recording audio from my microphone using
sox
and saving the recordings as.wav
files. When I try to attach these files to thegemini-1.5-flash-8b-latest
model, I receive this error:I suspect the issue is simply that
llm
doesn't recognize thataudio/wave
andaudio/wav
are actually the same MIME type. Is this correct?