sindresorhus / file-type

Detect the file type of a file, stream, or data
MIT License
3.72k stars 354 forks source link

Support for WebVTT files (text/vtt) #657

Closed AleksandrHovhannisyan closed 2 months ago

AleksandrHovhannisyan commented 3 months ago

First, thanks so much for this package! We've been using it at work to validate files uploaded by users and it works as expected for the majority of our use cases. There is one edge case where it doesn't currently validate WebVTT files (MIME type text/vtt, for captions shown in a video element's <track>).

The magic numbers for VTT files are as follows according to the W3 document titled WebVTT: The Web Video Text Tracks Format:

WebVTT files all begin with one of the following byte sequences (where "EOF" means the end of the file):

EF BB BF 57 45 42 56 54 54 0A EF BB BF 57 45 42 56 54 54 0D EF BB BF 57 45 42 56 54 54 20 EF BB BF 57 45 42 56 54 54 09 EF BB BF 57 45 42 56 54 54 EOF 57 45 42 56 54 54 0A 57 45 42 56 54 54 0D 57 45 42 56 54 54 20 57 45 42 56 54 54 09 57 45 42 56 54 54 EOF (An optional UTF-8 BOM, the ASCII string "WEBVTT", and finally a space, tab, line break, or the end of the file.)

Would it be possible to support this? If so, I'd be happy to help or put in a PR.

Borewit commented 3 months ago

That is in my opinion in scope.

Please note that we got the BOM covered in a generic way:

https://github.com/sindresorhus/file-type/blob/988bf4bc9f9bc98e8f3360da4dfa36e5caa455b3/core.js#L251-L255

So ignore the magic numbers with the BOM field (EF BB BF), those will be automatically covered.

I suggest to trigger on WEBVTT, and possibly match the last character.

AleksandrHovhannisyan commented 3 months ago

Thanks! That makes sense. I'll work on this and put up a PR.