Open chapmanjacobd opened 2 years ago
yep. when I run with a fake subtitles file it is instant loading (300 ms)
touch /tmp/sub.srt
strace catt -d "Xylo and Orchestra" cast '/mnt/d/81_New_Music/unsorted/Alvars_Orkester_Power_Electronics_eIDxDKUBZxI_.oga' -s /tmp/sub.srt
I could see two options for fixing this:
Not urgent for me since I have already adopted the holy /tmp/sub.srt
into my life ✝️
Hmm, ten minutes? How many files do you have in that folder? We're just checking the filenames, which should be pretty fast, though maybe we could glob to make it faster.
yeah I think
glob("*.vtt") + glob("*.srt")
would be a lot faster... but also if you could add --no-auto-subs
that would be nice
Yeah, I'm thinking of something like that as well... We could even glob on the original file's stem, to avoid looking through everything.
yeah I think the iterdir() is maybe the thing that's making it slow
Yeah, sounds like it. Unfortunately, we're doing case-insensitive matching, and any sort of globbing would break that... I don't have a folder big enough to test performance improvements, would you be able to test a regex-based alternative?
yeah feel free to put some functions here and I'll test it from ipython
Can you try this patch, actually?
diff --git a/catt/util.py b/catt/util.py
index ecc1086..d079c9e 100644
--- a/catt/util.py
+++ b/catt/util.py
@@ -1,5 +1,6 @@
import ipaddress
import json
+import re
import socket
import tempfile
import time
@@ -63,11 +64,9 @@ def hunt_subtitles(video):
"""Searches for subtitles in the current folder"""
video_path = Path(video)
- video_path_stem_lower = video_path.stem.lower()
+ regex = re.compile(video_path.stem() + ".(vtt|srt)", re.I)
for entry_path in video_path.parent.iterdir():
- if entry_path.is_dir():
- continue
- if entry_path.stem.lower().startswith(video_path_stem_lower) and entry_path.suffix.lower() in [".vtt", ".srt"]:
+ if regex.match(entry_path):
return str(entry_path.resolve())
return None
it should probably be video_path.stem not stem()
on a small directory it already seems faster. I'll try the big one next
import re
from pathlib import Path
video = "/home/xk/d/83_ClassicalComposers/01-Allegro_Di_Molto.opus"
def oldfunc():
video_path = Path(video)
video_path_stem_lower = video_path.stem.lower()
for entry_path in video_path.parent.iterdir():
if entry_path.is_dir():
continue
if entry_path.stem.lower().startswith(video_path_stem_lower) and entry_path.suffix.lower() in [".vtt", ".srt"]:
return str(entry_path.resolve())
def newfunc():
video_path = Path(video)
regex = re.compile(video_path.stem + ".(vtt|srt)", re.I)
for entry_path in video_path.parent.iterdir():
if regex.match(str(entry_path)):
return str(entry_path.resolve())
# oldfunc()
newfunc()
░███ /m/d 🥨 hyperfine 'python /tmp/test.py'
Benchmark 1: python /tmp/test.py
Time (mean ± σ): 76.4 ms ± 25.1 ms [User: 49.4 ms, System: 8.5 ms]
Range (min … max): 52.9 ms … 182.8 ms 44 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
▒▓▓▓ 3.902s /m/d 🐻 hyperfine 'python /tmp/test.py'
Benchmark 1: python /tmp/test.py
Time (mean ± σ): 62.9 ms ± 23.2 ms [User: 42.6 ms, System: 6.1 ms]
Range (min … max): 37.7 ms … 129.4 ms 56 runs
Hm, that doesn't seem faster enough... It'll still take minutes, I was hoping for something that took milliseconds. Hm.
that is ms
. It is quite a bit faster. The 3.902s seconds is from my shell; the time it took for hyperfine to exit (after running a test where it ran 44 times)
here is the big folder:
OLD:
it's still running......
NEW:
Benchmark 1: python /tmp/test.py
Time (mean ± σ): 1.872 s ± 1.384 s [User: 1.183 s, System: 0.091 s]
Range (min … max): 1.291 s … 5.800 s 10 runs
Warning: The first benchmarking run for this command was significantly slower than the rest (5.800 s). This could be caused by (filesystem) caches that were not filled until after the first run. You should consider using the '--warmup' option to fill those caches before the actual benchmark. Alternatively, use the '--prepare' option to clear the caches before each timing run.
1.291 s … 5.800
seconds is definitely okay! it's a big folder
that is ms. It is quite a bit faster.
I know, but the first one is the old function and the second one is the new, no? It's around 8% faster, if so.
1.291 s … 5.800 seconds is definitely okay! it's a big folder
That's great, though now I'm confused because the initial benchmark doesn't seem enough to justify this improvement.
By the way, can you try patching catt to see if it works as intended? I didn't actually try whether it finds the file, I wouldn't want it to be broken...
Also, keep in mind that this might break early just because the file you picked was found soon, and it didn't go through the entire folder. If you can run both the old and the new code on the large folder, to make sure it actually takes a long time with the old code, I'd be grateful.
big improvement! the old code is taking 474.056 s
to run. hyperfine wants to run it a bunch of times to get a statistically significant result but that's gonna take an hour... so I'm just gonna do testing on the small folder (which is still quite large 50,000 files)
hmmm yeah it prints None
when I do
video = "/home/xk/d/75_MovieQueue/Lookism (外貌至上主义) EP.30 - Eng Sub (Chinese Drama) [bSA5udcf7mk].webm"
...
print(newfunc())
but the subtitle files are named like this
Lookism (外貌至上主义) EP.34 - Eng Sub (Chinese Drama) [OupHlRLoZqs].en-GB.vtt
the old code is able to find it
Ah ok, can you change the regex to:
regex = re.compile(video_path.stem + ".*\.(vtt|srt)", re.I)
hmm same result
my newfunc2 is a tiny bit faster but it's not working either
def newfunc():
video_path = Path(video)
regex = re.compile(video_path.stem + r".*\.(vtt|srt)", re.I)
for entry_path in video_path.parent.iterdir():
if regex.match(str(entry_path)):
return str(entry_path.resolve())
def newfunc2():
video_path = Path(video)
subtitles = list(video_path.parent.glob("*.vtt")) + list(video_path.parent.glob("*.srt"))
if len(subtitles) > 0:
EXP_VIDEO_FILE = re.compile(video_path.stem + r".*")
for subtitle in subtitles:
if m := EXP_VIDEO_FILE.match(subtitle.stem):
return m.group(1)
print(oldfunc())
print(newfunc())
print(newfunc2())
python /tmp/test.py
/mnt/d/75_MovieQueue/Lookism (外貌至上主义) EP.30 - Eng Sub (Chinese Drama) [bSA5udcf7mk].en-GB.vtt
None
None
That's very odd, can you print the stem and the video file so we can be sure that the regex will match? Also, can you add re.escape()
?
regex = re.compile(re.escape(video_path.stem) + r".*\.(vtt|srt)", re.I)
I added re.escape and mine is working but not yours lol :/
def newfunc():
video_path = Path(video)
regex = re.compile(re.escape(video_path.stem) + r".*\.(vtt|srt)", re.I)
for entry_path in video_path.parent.iterdir():
if regex.match(str(entry_path)):
return str(entry_path.resolve())
def newfunc2():
video_path = Path(video)
subtitles = list(video_path.parent.glob("*.vtt")) + list(video_path.parent.glob("*.srt"))
if len(subtitles) > 0:
EXP_VIDEO_FILE = re.compile(re.escape(video_path.stem) + r".*", re.I)
for subtitle in subtitles:
if m := EXP_VIDEO_FILE.match(subtitle.stem):
return Path(subtitle).resolve()
print(oldfunc())
print(newfunc())
print(newfunc2())
▓███ /m/d 🍙 python /tmp/test.py
/mnt/d/75_MovieQueue/Lookism (外貌至上主义) EP.30 - Eng Sub (Chinese Drama) [bSA5udcf7mk].en-GB.vtt
None
/mnt/d/75_MovieQueue/Lookism (外貌至上主义) EP.30 - Eng Sub (Chinese Drama) [bSA5udcf7mk].en-GB.vtt
Hmm, can you print the filenames and regexes? I want to see what mine looks like, it's odd.
ohhh maybe the .
doesn't need to be escaped
import re
from pathlib import Path
video = "/home/xk/d/75_MovieQueue/Lookism (外貌至上主义) EP.30 - Eng Sub (Chinese Drama) [bSA5udcf7mk].webm"
def newfunc():
video_path = Path(video)
regex = re.compile(re.escape(video_path.stem) + r".*.(vtt|srt)", re.I)
print(regex)
for entry_path in video_path.parent.iterdir():
if regex.match(str(entry_path)):
return str(entry_path.resolve())
def newfunc2():
video_path = Path(video)
subtitles = list(video_path.parent.glob("*.vtt")) + list(video_path.parent.glob("*.srt"))
if len(subtitles) > 0:
EXP_VIDEO_FILE = re.compile(re.escape(video_path.stem) + r".*", re.I)
print(EXP_VIDEO_FILE)
for subtitle in subtitles:
if m := EXP_VIDEO_FILE.match(subtitle.stem):
return Path(subtitle).resolve()
print(newfunc())
print(newfunc2())
re.compile('Lookism\\ \\(外貌至上主义\\)\\ EP\\.30\\ \\-\\ Eng\\ Sub\\ \\(Chinese\\ Drama\\)\\ \\[bSA5udcf7mk\\].*\\.(vtt|srt)', re.IGNORECASE)
None
re.compile('Lookism\\ \\(外貌至上主义\\)\\ EP\\.30\\ \\-\\ Eng\\ Sub\\ \\(Chinese\\ Drama\\)\\ \\[bSA5udcf7mk\\].*', re.IGNORECASE)
/mnt/d/75_MovieQueue/Lookism (外貌至上主义) EP.30 - Eng Sub (Chinese Drama) [bSA5udcf7mk].en-GB.vtt
it's still not working with r".*.(vtt|srt)"
re.compile('Lookism\\ \\(外貌至上主义\\)\\ EP\\.30\\ \\-\\ Eng\\ Sub\\ \\(Chinese\\ Drama\\)\\ \\[bSA5udcf7mk\\].*.(vtt|srt)', re.IGNORECASE)
None
re.compile('Lookism\\ \\(外貌至上主义\\)\\ EP\\.30\\ \\-\\ Eng\\ Sub\\ \\(Chinese\\ Drama\\)\\ \\[bSA5udcf7mk\\].*', re.IGNORECASE)
/mnt/d/75_MovieQueue/Lookism (外貌至上主义) EP.30 - Eng Sub (Chinese Drama) [bSA5udcf7mk].en-GB.vtt
I don't think python supports |
with regex or globs or something
Hm no, it definitely should... Does mine work if you remove the \.(vtt|srt)
bit?
ohhh weird yours doesn't work even with the same regex
░░▓█ /m/d 🍞 python /tmp/test.py
re.compile('Lookism\\ \\(外貌至上主义\\)\\ EP\\.30\\ \\-\\ Eng\\ Sub\\ \\(Chinese\\ Drama\\)\\ \\[bSA5udcf7mk\\].*', re.IGNORECASE)
None
re.compile('Lookism\\ \\(外貌至上主义\\)\\ EP\\.30\\ \\-\\ Eng\\ Sub\\ \\(Chinese\\ Drama\\)\\ \\[bSA5udcf7mk\\].*', re.IGNORECASE)
/mnt/d/75_MovieQueue/Lookism (外貌至上主义) EP.30 - Eng Sub (Chinese Drama) [bSA5udcf7mk].en-GB.vtt
my version is even faster on the big folder :)
▒▒▓▓ /m/d 🧸 hyperfine 'python /tmp/test.py'
Benchmark 1: python /tmp/test.py
Time (mean ± σ): 1.323 s ± 0.269 s [User: 0.582 s, System: 0.201 s]
Range (min … max): 0.987 s … 1.807 s 10 runs
but feel free to keep playing around with it. I need to focus on something else
Yes but your version isn't case insensitive, sadly...
yes it is
subtitles = list(video_path.parent.glob("*.vtt")) + list(video_path.parent.glob("*.srt"))
if len(subtitles) > 0:
EXP_VIDEO_FILE = re.compile(re.escape(video_path.stem) + r".*", re.I)
Try searching for a file that ends in .VTT
.
well you could always do this
subtitles = (
list(video_path.parent.glob("*.vtt"))
+ list(video_path.parent.glob("*.srt"))
+ list(video_path.parent.glob("*.VTT"))
+ list(video_path.parent.glob("*.SRT"))
)
True, but then you wouldn't get files called .Srt
or something (which is weird, but I'd rather not change the way it works). I'll try to figure out why my version doesn't work, thanks for your help!
Another idea that I had is that you could try is to limit the total number of files scanned:
MAX_PATHS_SCANNED = 10_000
PATHS_SCANNED = 0
with os.scandir(path_dir) as entries:
for entry in entries:
if entry.is_file():
entry.path # do glob / regex test here
PATHS_SCANNED += 1
if PATHS_SCANNED >= MAX_PATHS_SCANNED:
break
I think the hunting for subtitles code is problematic
https://github.com/skorokithakis/catt/blob/03f1bfc769df97bb3423dbff9b4e5563dd81daac/catt/util.py#L62
It takes a really, really long time (10 minutes+) in a big folder