Scanning big directory - Githubissues

chapmanjacobd commented 2 years ago

I think the hunting for subtitles code is problematic

https://github.com/skorokithakis/catt/blob/03f1bfc769df97bb3423dbff9b4e5563dd81daac/catt/util.py#L62

It takes a really, really long time (10 minutes+) in a big folder

strace catt -d "Xylo and Orchestra" cast '/mnt/d/81_New_Music/unsorted/Alvars_Orkester_Power_Electronics_eIDxDKUBZxI_.oga'
...
newfstatat(AT_FDCWD, "/mnt/d/81_New_Music/unsorted/August_18,_1928_Sonny_Til,_Crying_in_the_Chapel_76KRtx-jpxA_n.opus", {st_mode=S_IFREG|0644, st_size=2946692, ...}, 0) = 0
newfstatat(AT_FDCWD, "/mnt/d/81_New_Music/unsorted/August_22,_1934_John_Chowning,_Turenas_mjH7mjPGG4w_n.opus", {st_mode=S_IFREG|0644, st_size=3059163, ...}, 0) = 0
newfstatat(AT_FDCWD, "/mnt/d/81_New_Music/unsorted/August_23,_1931_Barbara_Eden,_Spinning_Wheel_Lpy0JaVzie0_n.opus", {st_mode=S_IFREG|0644, st_size=3601511, ...}, 0) = 0
newfstatat(AT_FDCWD, "/mnt/d/81_New_Music/unsorted/August_17,_1893_Mae_West,_I_m_No_Angel_TTMtHKa_SqA_n.opus", {st_mode=S_IFREG|0644, st_size=3405927, ...}, 0) = 0
newfstatat(AT_FDCWD, "/mnt/d/81_New_Music/unsorted/August_16,_1906_Irene_Taylor,_Willow_Weep_For_Me_Ann_Ronell_lKNh7UZa13w_n.opus", {st_mode=S_IFREG|0644, st_size=3361821, ...}, 0) = 0
newfstatat(AT_FDCWD, "/mnt/d/81_New_Music/unsorted/August_15,_1958_Abbey_Lincoln,_It_s_magic_rN1dLUNoJQ_n.opus", {st_mode=S_IFREG|0644, st_size=3882576, ...}, 0) = 0
newfstatat(AT_FDCWD, "/mnt/d/81_New_Music/unsorted/August_13,_1952_Hound_Dog,_Big_Mama_Thornton-JKmaXBdVAM_n.opus", {st_mode=S_IFREG|0644, st_size=2727226, ...}, 0) = 0
newfstatat(AT_FDCWD, "/mnt/d/81_New_Music/unsorted/August_12,_1918_Trudy_Erwin,_The_Coffee_Song_r5Mt9EaNfbM_n.opus", {st_mode=S_IFREG|0644, st_size=2118336, ...}, 0) = 0
newfstatat(AT_FDCWD, "/mnt/d/81_New_Music/unsorted/August_16,_2018_RIP_Aretha_Franklin,_I_say_a_little_prayer_3ZTJJlzZvIw_n.opus", ^Cstrace: Process 4398 detached
 <detached ...>

Aborted!

chapmanjacobd commented 2 years ago

yep. when I run with a fake subtitles file it is instant loading (300 ms)

touch /tmp/sub.srt
strace catt -d "Xylo and Orchestra" cast '/mnt/d/81_New_Music/unsorted/Alvars_Orkester_Power_Electronics_eIDxDKUBZxI_.oga' -s /tmp/sub.srt

chapmanjacobd commented 2 years ago

I could see two options for fixing this:

Improve the speed of searching a lot (but it might not be enough unless you only look for subtitles that are the same basename as the file)
Add a flag to disable automatic subtitle searching (I would probably prefer this fix)

Not urgent for me since I have already adopted the holy /tmp/sub.srt into my life ✝️

skorokithakis commented 2 years ago

Hmm, ten minutes? How many files do you have in that folder? We're just checking the filenames, which should be pretty fast, though maybe we could glob to make it faster.

chapmanjacobd commented 2 years ago

yeah I think

glob("*.vtt") + glob("*.srt")

would be a lot faster... but also if you could add --no-auto-subs that would be nice

skorokithakis commented 2 years ago

Yeah, I'm thinking of something like that as well... We could even glob on the original file's stem, to avoid looking through everything.

chapmanjacobd commented 2 years ago

yeah I think the iterdir() is maybe the thing that's making it slow

skorokithakis commented 2 years ago

Yeah, sounds like it. Unfortunately, we're doing case-insensitive matching, and any sort of globbing would break that... I don't have a folder big enough to test performance improvements, would you be able to test a regex-based alternative?

chapmanjacobd commented 2 years ago

yeah feel free to put some functions here and I'll test it from ipython

skorokithakis commented 2 years ago

Can you try this patch, actually?

diff --git a/catt/util.py b/catt/util.py
index ecc1086..d079c9e 100644
--- a/catt/util.py
+++ b/catt/util.py
@@ -1,5 +1,6 @@
 import ipaddress
 import json
+import re
 import socket
 import tempfile
 import time
@@ -63,11 +64,9 @@ def hunt_subtitles(video):
     """Searches for subtitles in the current folder"""

     video_path = Path(video)
-    video_path_stem_lower = video_path.stem.lower()
+    regex = re.compile(video_path.stem() + ".(vtt|srt)", re.I)
     for entry_path in video_path.parent.iterdir():
-        if entry_path.is_dir():
-            continue
-        if entry_path.stem.lower().startswith(video_path_stem_lower) and entry_path.suffix.lower() in [".vtt", ".srt"]:
+        if regex.match(entry_path):
             return str(entry_path.resolve())
     return None

chapmanjacobd commented 2 years ago

it should probably be video_path.stem not stem()

on a small directory it already seems faster. I'll try the big one next

import re
from pathlib import Path

video = "/home/xk/d/83_ClassicalComposers/01-Allegro_Di_Molto.opus"

def oldfunc():
    video_path = Path(video)
    video_path_stem_lower = video_path.stem.lower()
    for entry_path in video_path.parent.iterdir():
        if entry_path.is_dir():
            continue
        if entry_path.stem.lower().startswith(video_path_stem_lower) and entry_path.suffix.lower() in [".vtt", ".srt"]:
            return str(entry_path.resolve())

def newfunc():
    video_path = Path(video)
    regex = re.compile(video_path.stem + ".(vtt|srt)", re.I)
    for entry_path in video_path.parent.iterdir():
        if regex.match(str(entry_path)):
            return str(entry_path.resolve())

# oldfunc()
newfunc()

░███ /m/d 🥨 hyperfine 'python /tmp/test.py'
Benchmark 1: python /tmp/test.py
  Time (mean ± σ):      76.4 ms ±  25.1 ms    [User: 49.4 ms, System: 8.5 ms]
  Range (min … max):    52.9 ms … 182.8 ms    44 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

▒▓▓▓ 3.902s /m/d 🐻 hyperfine 'python /tmp/test.py'
Benchmark 1: python /tmp/test.py
  Time (mean ± σ):      62.9 ms ±  23.2 ms    [User: 42.6 ms, System: 6.1 ms]
  Range (min … max):    37.7 ms … 129.4 ms    56 runs

skorokithakis commented 2 years ago

Hm, that doesn't seem faster enough... It'll still take minutes, I was hoping for something that took milliseconds. Hm.

chapmanjacobd commented 2 years ago

that is ms. It is quite a bit faster. The 3.902s seconds is from my shell; the time it took for hyperfine to exit (after running a test where it ran 44 times)

here is the big folder:

OLD:
it's still running......

NEW:
Benchmark 1: python /tmp/test.py
  Time (mean ± σ):      1.872 s ±  1.384 s    [User: 1.183 s, System: 0.091 s]
  Range (min … max):    1.291 s …  5.800 s    10 runs

  Warning: The first benchmarking run for this command was significantly slower than the rest (5.800 s). This could be caused by (filesystem) caches that were not filled until after the first run. You should consider using the '--warmup' option to fill those caches before the actual benchmark. Alternatively, use the '--prepare' option to clear the caches before each timing run.

1.291 s … 5.800 seconds is definitely okay! it's a big folder

skorokithakis commented 2 years ago

that is ms. It is quite a bit faster.

I know, but the first one is the old function and the second one is the new, no? It's around 8% faster, if so.

1.291 s … 5.800 seconds is definitely okay! it's a big folder

That's great, though now I'm confused because the initial benchmark doesn't seem enough to justify this improvement.

By the way, can you try patching catt to see if it works as intended? I didn't actually try whether it finds the file, I wouldn't want it to be broken...

Also, keep in mind that this might break early just because the file you picked was found soon, and it didn't go through the entire folder. If you can run both the old and the new code on the large folder, to make sure it actually takes a long time with the old code, I'd be grateful.

chapmanjacobd commented 2 years ago

big improvement! the old code is taking 474.056 s to run. hyperfine wants to run it a bunch of times to get a statistically significant result but that's gonna take an hour... so I'm just gonna do testing on the small folder (which is still quite large 50,000 files)

chapmanjacobd commented 2 years ago

hmmm yeah it prints None when I do

video = "/home/xk/d/75_MovieQueue/Lookism (外貌至上主义) EP.30 - Eng Sub (Chinese Drama) [bSA5udcf7mk].webm"
...
print(newfunc())

but the subtitle files are named like this

Lookism (外貌至上主义) EP.34 - Eng Sub (Chinese Drama) [OupHlRLoZqs].en-GB.vtt

the old code is able to find it

skorokithakis commented 2 years ago

Ah ok, can you change the regex to:

regex = re.compile(video_path.stem + ".*\.(vtt|srt)", re.I)

chapmanjacobd commented 2 years ago

hmm same result

my newfunc2 is a tiny bit faster but it's not working either


def newfunc():
    video_path = Path(video)
    regex = re.compile(video_path.stem + r".*\.(vtt|srt)", re.I)
    for entry_path in video_path.parent.iterdir():
        if regex.match(str(entry_path)):
            return str(entry_path.resolve())

def newfunc2():
    video_path = Path(video)
    subtitles = list(video_path.parent.glob("*.vtt")) + list(video_path.parent.glob("*.srt"))

    if len(subtitles) > 0:
        EXP_VIDEO_FILE = re.compile(video_path.stem + r".*")

        for subtitle in subtitles:
            if m := EXP_VIDEO_FILE.match(subtitle.stem):
                return m.group(1)

print(oldfunc())
print(newfunc())
print(newfunc2())

 python /tmp/test.py
/mnt/d/75_MovieQueue/Lookism (外貌至上主义) EP.30 - Eng Sub (Chinese Drama) [bSA5udcf7mk].en-GB.vtt
None
None

skorokithakis commented 2 years ago

That's very odd, can you print the stem and the video file so we can be sure that the regex will match? Also, can you add re.escape()?

    regex = re.compile(re.escape(video_path.stem) + r".*\.(vtt|srt)", re.I)

chapmanjacobd commented 2 years ago

I added re.escape and mine is working but not yours lol :/


def newfunc():
    video_path = Path(video)
    regex = re.compile(re.escape(video_path.stem) + r".*\.(vtt|srt)", re.I)
    for entry_path in video_path.parent.iterdir():
        if regex.match(str(entry_path)):
            return str(entry_path.resolve())

def newfunc2():
    video_path = Path(video)
    subtitles = list(video_path.parent.glob("*.vtt")) + list(video_path.parent.glob("*.srt"))

    if len(subtitles) > 0:
        EXP_VIDEO_FILE = re.compile(re.escape(video_path.stem) + r".*", re.I)

        for subtitle in subtitles:
            if m := EXP_VIDEO_FILE.match(subtitle.stem):
                return Path(subtitle).resolve()

print(oldfunc())
print(newfunc())
print(newfunc2())

▓███ /m/d 🍙 python /tmp/test.py
/mnt/d/75_MovieQueue/Lookism (外貌至上主义) EP.30 - Eng Sub (Chinese Drama) [bSA5udcf7mk].en-GB.vtt
None
/mnt/d/75_MovieQueue/Lookism (外貌至上主义) EP.30 - Eng Sub (Chinese Drama) [bSA5udcf7mk].en-GB.vtt

skorokithakis commented 2 years ago

Hmm, can you print the filenames and regexes? I want to see what mine looks like, it's odd.

chapmanjacobd commented 2 years ago

ohhh maybe the . doesn't need to be escaped


import re
from pathlib import Path

video = "/home/xk/d/75_MovieQueue/Lookism (外貌至上主义) EP.30 - Eng Sub (Chinese Drama) [bSA5udcf7mk].webm"

def newfunc():
    video_path = Path(video)
    regex = re.compile(re.escape(video_path.stem) + r".*.(vtt|srt)", re.I)
    print(regex)

    for entry_path in video_path.parent.iterdir():
        if regex.match(str(entry_path)):
            return str(entry_path.resolve())

def newfunc2():
    video_path = Path(video)
    subtitles = list(video_path.parent.glob("*.vtt")) + list(video_path.parent.glob("*.srt"))

    if len(subtitles) > 0:
        EXP_VIDEO_FILE = re.compile(re.escape(video_path.stem) + r".*", re.I)
        print(EXP_VIDEO_FILE)

        for subtitle in subtitles:
            if m := EXP_VIDEO_FILE.match(subtitle.stem):
                return Path(subtitle).resolve()

print(newfunc())
print(newfunc2())

re.compile('Lookism\\ \\(外貌至上主义\\)\\ EP\\.30\\ \\-\\ Eng\\ Sub\\ \\(Chinese\\ Drama\\)\\ \\[bSA5udcf7mk\\].*\\.(vtt|srt)', re.IGNORECASE)
None
re.compile('Lookism\\ \\(外貌至上主义\\)\\ EP\\.30\\ \\-\\ Eng\\ Sub\\ \\(Chinese\\ Drama\\)\\ \\[bSA5udcf7mk\\].*', re.IGNORECASE)
/mnt/d/75_MovieQueue/Lookism (外貌至上主义) EP.30 - Eng Sub (Chinese Drama) [bSA5udcf7mk].en-GB.vtt

chapmanjacobd commented 2 years ago

it's still not working with r".*.(vtt|srt)"

re.compile('Lookism\\ \\(外貌至上主义\\)\\ EP\\.30\\ \\-\\ Eng\\ Sub\\ \\(Chinese\\ Drama\\)\\ \\[bSA5udcf7mk\\].*.(vtt|srt)', re.IGNORECASE)
None
re.compile('Lookism\\ \\(外貌至上主义\\)\\ EP\\.30\\ \\-\\ Eng\\ Sub\\ \\(Chinese\\ Drama\\)\\ \\[bSA5udcf7mk\\].*', re.IGNORECASE)
/mnt/d/75_MovieQueue/Lookism (外貌至上主义) EP.30 - Eng Sub (Chinese Drama) [bSA5udcf7mk].en-GB.vtt

I don't think python supports | with regex or globs or something

skorokithakis commented 2 years ago

Hm no, it definitely should... Does mine work if you remove the \.(vtt|srt) bit?

chapmanjacobd commented 2 years ago

ohhh weird yours doesn't work even with the same regex

░░▓█ /m/d 🍞 python /tmp/test.py
re.compile('Lookism\\ \\(外貌至上主义\\)\\ EP\\.30\\ \\-\\ Eng\\ Sub\\ \\(Chinese\\ Drama\\)\\ \\[bSA5udcf7mk\\].*', re.IGNORECASE)
None
re.compile('Lookism\\ \\(外貌至上主义\\)\\ EP\\.30\\ \\-\\ Eng\\ Sub\\ \\(Chinese\\ Drama\\)\\ \\[bSA5udcf7mk\\].*', re.IGNORECASE)
/mnt/d/75_MovieQueue/Lookism (外貌至上主义) EP.30 - Eng Sub (Chinese Drama) [bSA5udcf7mk].en-GB.vtt

chapmanjacobd commented 2 years ago

my version is even faster on the big folder :)

▒▒▓▓ /m/d 🧸 hyperfine 'python /tmp/test.py'
Benchmark 1: python /tmp/test.py
  Time (mean ± σ):      1.323 s ±  0.269 s    [User: 0.582 s, System: 0.201 s]
  Range (min … max):    0.987 s …  1.807 s    10 runs

but feel free to keep playing around with it. I need to focus on something else

skorokithakis commented 2 years ago

Yes but your version isn't case insensitive, sadly...

chapmanjacobd commented 2 years ago

yes it is

    subtitles = list(video_path.parent.glob("*.vtt")) + list(video_path.parent.glob("*.srt"))

    if len(subtitles) > 0:
        EXP_VIDEO_FILE = re.compile(re.escape(video_path.stem) + r".*", re.I)

skorokithakis commented 2 years ago

Try searching for a file that ends in .VTT.

chapmanjacobd commented 2 years ago

well you could always do this

    subtitles = (
        list(video_path.parent.glob("*.vtt"))
        + list(video_path.parent.glob("*.srt"))
        + list(video_path.parent.glob("*.VTT"))
        + list(video_path.parent.glob("*.SRT"))
    )

skorokithakis commented 2 years ago

True, but then you wouldn't get files called .Srt or something (which is weird, but I'd rather not change the way it works). I'll try to figure out why my version doesn't work, thanks for your help!

chapmanjacobd commented 6 months ago

Another idea that I had is that you could try is to limit the total number of files scanned:

    MAX_PATHS_SCANNED = 10_000
    PATHS_SCANNED = 0
    with os.scandir(path_dir) as entries:
        for entry in entries:
            if entry.is_file():
                entry.path  # do glob / regex test here

            PATHS_SCANNED += 1
            if PATHS_SCANNED >= MAX_PATHS_SCANNED:
                break

skorokithakis / catt

Scanning big directory #393