mpv-player / mpv

🎥 Command line video player
https://mpv.io
Other
28.67k stars 2.93k forks source link

.ass file using UTF-16LE encoding with a BOM cannot be detected #11482

Open TieBaMma opened 1 year ago

TieBaMma commented 1 year ago

The following is an example file:

[WLGO&PPX][Itazura_na_Kiss][DVDRip][05][x264_AAC][880x480][FB7AD9EF].gb.zip

After saving as a file using e.g. UTF-8 encoding, the problem disappears, but it'll be more convenient if UTF-16LE with a BOM is supported.

dyphire commented 1 year ago

Can be reproduced, it seems that the file decoding failed. The UTF-8 encoding with a BOM will also fail.

Update: When I try to load the subtitle using ffplay, it also fails. so it's most likely an ffmpeg issue and should be reported there.

dyphire commented 1 year ago

If use --demuxer-lavf-format=ass to forcing the ffmpeg ass subtitle demuxer, mpv will successfully open this subtitle. The issue is that FFmpeg cannot detect the subtitle format correctly.

This appears to be an issue with FFmpeg’s ass_probe, which rejects all files that don’t start with optionally preceded.

dyphire commented 1 year ago

After that, I tried a dirty patch to fix the issue in FFmpeg, which can also fix another related issue https://github.com/mpv-player/mpv/issues/2846

From 4e88f228ec52d84fa84cfb892da51911b100150e Mon Sep 17 00:00:00 2001
From: dyphire <qimoge@gmail.com>
Date: Wed, 27 Sep 2023 22:16:48 +0800
Subject: [PATCH] avformat/assdec: more lenient probing

---
 libavformat/assdec.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/libavformat/assdec.c b/libavformat/assdec.c
index 6b37af15e8..e2ad3f9a34 100644
--- a/libavformat/assdec.c
+++ b/libavformat/assdec.c
@@ -34,7 +34,7 @@ typedef struct ASSContext {

 static int ass_probe(const AVProbeData *p)
 {
-    char buf[13];
+    char buf[1000];
     FFTextReader tr;
     ff_text_init_buf(&tr, p->buf, p->buf_size);

@@ -43,7 +43,7 @@ static int ass_probe(const AVProbeData *p)

     ff_text_read(&tr, buf, sizeof(buf));

-    if (!memcmp(buf, "[Script Info]", 13))
+    if (strstr(buf, "[Script Info]") || strstr(buf, "[V4 Styles]") || strstr(buf, "[V4+ Styles]"))
         return AVPROBE_SCORE_MAX;

     return 0;
-- 

test build for windows: https://github.com/dyphire/mpv-winbuild/actions/runs/4493477864

I'm not satisfied with the way this patch works, but it's surprising that this issue has been around for over 7 years and still hasn't been fixed in FFmpeg.

llyyr commented 1 year ago

I reported this to trac https://trac.ffmpeg.org/ticket/10583

While the above diff should work, ffmpeg already has code for stripping bom marks so that patch is unlikely to be accepted upstream since it doesn't address the root cause of the issue.

llyyr commented 1 year ago

There are no bugs, this file just has a double BOM. I've verified that the first 4 bytes are fffe fffe, removing the second BOM allows the file to be loaded as utf16le correctly.

There are no bugs, however it might be possible to hack lavf to load these types of files, but I'd prefer not to submit such a patch. Are there a ton of files like this?

The UTF-8 encoding with a BOM will also fail.

This is not true, I tested utf-8, utf16be and utf16le BOMs they all work correctly.

You can also work around it like so

mpv [video] --sub-file=lavf://subfile,,start,2,end,0,,:[video]
dyphire commented 1 year ago

There are no bugs, however it might be possible to hack lavf to load these types of files, but I'd prefer not to submit such a patch. Are there a ton of files like this?

In fact, there are many subtitles files with this issue, and I have also seen feedback elsewhere. For this reason, I have written a python script to handle subtitle encoding conversion. see https://github.com/dyphire/subtitle-convert

However, what I am more concerned about is the issue https://github.com/mpv-player/mpv/issues/2846 There are much more subtitle files missing this header of [Script Info], and the users who are troubled by this are more widespread. As mentioned there https://github.com/libass/libass/issues/325#issuecomment-452033155, it is a more reasonable behavior to additionally detect the information of [V4+ Styles] for judgment, and no functioning ASS subtitles will be missing it.

llyyr commented 1 year ago

There are much more subtitle files missing this header of [Script Info],

Even if you make ffmpeg probe such broken files correctly, it's pointless because libass doesn't know how to parse PlayResX/PlayResY without the section header. So your subtitles will appear broken anyway.

I could make such a change to lavf if libass became dumber in parsing like VSFilter, but currently there's not much point.

In fact, there are many subtitles files with this issue, and I have also seen feedback elsewhere.

Hopefully it gets merged lol https://ffmpeg.org/pipermail/ffmpeg-devel/2023-September/314767.html

Also if you have a sample of a broken UTF-8 BOM file it'd be helpful

/cc @TheOneric

dyphire commented 1 year ago

Also if you have a sample of a broken UTF-8 BOM file it'd be helpful

The broken UTF-8 BOM file was only encoded from the example file, so it should be the same bug.

Even if you make ffmpeg probe such broken files correctly, it's pointless because libass doesn't know how to parse PlayResX/PlayResY without the section header. So your subtitles will appear broken anyway.

I could make such a change to lavf if libass became dumber in parsing like VSFilter, but currently there's not much point.

As mentioned in the comments above https://github.com/libass/libass/issues/325#issuecomment-452033155, libass’ own whole-file parser handles files without [Script Info] just fine (In the past, mpv was work with use it), as vsfilter did (mpc-hc/be).

This is an ffmpeg quirk rather than a libass issue. There are thousands or more of these subtitle files on the internet (I suspect they were generated by similar scripts), and it seems foolish to not attempt to fix this compatibility issue.

llyyr commented 1 year ago

libass’ own whole-file parser handles files without [Script Info] just fine, as vsfilter did (mpc-hc/be).

It can detect that it's an ass file, and playback dialogue correctly. However it won't read options without a header, unlike VSFilter.

You just get this message after fixing the ffmpeg bug, making the whole thing kind of pointless

[sub/ass] Neither PlayResX nor PlayResY defined. Assuming 384x288

dyphire commented 1 year ago

libass’ own whole-file parser handles files without [Script Info] just fine, as vsfilter did (mpc-hc/be).

It can detect that it's an ass file, and playback dialogue correctly. However it won't read options without a header, unlike VSFilter.

You just get this message after fixing the ffmpeg bug, making the whole thing kind of pointless

[sub/ass] Neither PlayResX nor PlayResY defined. Assuming 384x288

I don't think this is completely meaningless. If the ass tag is not used in these subtitles (It is likely that the srt subtitles were converted using the wrong script), then it has almost no impact. Of course, perhaps missing parsing should be fixed in libass.

I have ASS subtitles downloaded from various internet sites that do not have [Script Info] section completely (the first line is [V4+ Styles]).

In addition, there are extreme examples here https://github.com/libass/libass/issues/325#issue-396578735, these ass files completely lack PlayResX and PlayResY. libass and vsfilter will do the same thing. Anyway, it's better than failing to parse without rendering.

llyyr commented 1 year ago

I'll look into this in a few days, but it'd be better if libass attempted to find "PlayResX/Y:" even if it couldn't find the "[Script Info]" section header.

Of course, if there's a valid reason to not do this, then that's fair as well.

dyphire commented 1 year ago

I'll look into this in a few days, but it'd be better if libass attempted to find "PlayResX/Y:" even if it couldn't find the "[Script Info]" section header.

Of course, if there's a valid reason to not do this, then that's fair as well.

I noticed that only a simple libass patch is needed to parse missing "PlayResX/Y:" when the "[Script Info]" header does not exist.

From 4cfbbe2f70c02e156785de9bc12d124b6914bd29 Mon Sep 17 00:00:00 2001
From: dyphire <qimoge@gmail.com>
Date: Wed, 27 Sep 2023 22:48:04 +0800
Subject: [PATCH] Try to parse script properties even if missing "[Script Info]"

---
 libass/ass.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/libass/ass.c b/libass/ass.c
index 0679483..c0734d2 100644
--- a/libass/ass.c
+++ b/libass/ass.c
@@ -1143,7 +1143,7 @@ mem_fail:
 static int process_line(ASS_Track *track, char *str)
 {
     skip_spaces(&str);
-    if (!ass_strncasecmp(str, "[Script Info]", 13)) {
+    if (!ass_strncasecmp(str, "[Script Info]", 13) || !ass_strncasecmp(str, "ScriptType:", 11)) {
         track->parser_priv->state = PST_INFO;
     } else if (!ass_strncasecmp(str, "[V4 Styles]", 11)) {
         track->parser_priv->state = PST_STYLES;
--