pinterf / masktools

MaskTools v2 fork
Other
48 stars 11 forks source link

Speed question #11

Closed TbtBI closed 4 years ago

TbtBI commented 4 years ago

Hi, I tested the speed of this script: Avs+ 3.5, masktools 2.2.21

1080p input 8-bit
debandmask(60,16,4,4,1) # 659.3 fps
1080p input 8-bit
ConvertBits(16)
debandmask(60,16,4,4,1) # 209.5 fps

If I replace use_expr=1 with use_expr=0 in the script, the speed drops to 2-3 fps. Is it normal such speed drop?

VS r48

clip = 1080p input 8-bit
clip = debandmask(clip, 60,16,4,4,1) # 748.9 fps
clip = 1080p input 8-bit
clip = core.resize.Spline36(video, format=vs.YUV420P16)
clip = debandmask(clip, 15360,4096,1024,1024,1) # 648.0 fps

I expected the speed of high bit depth to be similar between avs+ and vs. Why avs+ (masktools) speed (>8-bit) is so low?

pinterf commented 4 years ago

I don't know what debandmask looks like from inside (debandmask is not a masktools function), but use_expr=1 calls avs+ Expr filter and passes the lut expression to Avisynth+. There are use_expr=1, and even 2, look at the documentation, I think they mean: 1= force Expr call only if mt_lut cannot use lookup tables (for memory reasons which is the case for example for a 16 bit lut_xy), 2= always call Expr.

Expr is Avisynth - and in VapourSynth - are just-in-time compiled machine code version of the expression evaluation. More or less they have the very same speeds for similar expressions. In masktools the non-lookup version in just a painfully slow C code,

Other reasons for speed differences -> ask the script writers :) Either they are no the very same procedures and/or avs version is too generic (I'd write it using only Expr without involving mt_lut but then it would't run on Avisynth 2.6 and before). Or the script does not use optimizations and/or other filters which are used in Avisynth version are older and not optimized well.

Anyway if the difference comes from a masktools function alone, then I could investigate it, but I think most other masktools functions are good enough, most of them has even AVX2.

TbtBI commented 4 years ago

I tagged "this script" with a link to the script in the first message - https://pastebin.com/Uii7yWqF The link has both AVS+ and VS script. The script contains mainly mt_expand, mt_inpand and mt_lutxy. Passing mt_luxy calculations to Expr with use_expr=1/2/3 improves the speed with >8-bit video by 50-60x (~4 fps vs ~210 fps). It seems that this is the place where masktools has only painfully slow C code? Also 210 fps (Avs+ Expr or mt_lutxy with use_expr>0) is ~3x slower than VS Expr (650 fps). I think the calculations in VS and AVS+ in that script are the same that's why I expected similar speed between AVS+ and VS.

pinterf commented 4 years ago

This is definitely different, Avisynth version is using rmask = Dither_build_gf3_range_mask(s, mrad)

TbtBI commented 4 years ago
Function debandmask(clip c, int "lo", int "hi", int "lothr", int "hithr", int "mrad", bool "stack16_in", bool "stack16_out")
{
    lo = Default(lo, 60)
    hi = Default(hi, 16)
    lothr = Default(lothr, 4)
    hithr = Default(hithr, 4)
    mrad = Default(mrad, 1)
    stack16_in = Default(stack16_in, false)
    stack16_out = Default(stack16_out, false)

    Assert(lo >= 0 && hi >= 0 && lothr >= 0 && hithr >= 0,         "debandmask: lo/hi/lothr/hithr must be >= 0")
    Assert(lo <= 255 && hi <= 255 && lothr <= 255 && hithr <= 255, "debandmask: lo/hi/lothr/hithr must be <= 255")

    stack16_in == false ? c : ConvertFromStacked(c)
    s = ExtractY()

    ma = mt_expand_multi(s, mode="ellipse", sw=mrad, sh=mrad)
    mi = mt_inpand_multi(s, mode="ellipse", sw=mrad, sh=mrad)

    rmask = Expr(ma, mi, "x y -")

    mexpr = "x "+String(lo)+" < y "+String(lothr)+" >= 255 0 ? x "+String(hi)+" > y "+String(hithr)+" >= 255 0 ? y x "+String(lo)+" - "+String(hi)+" "+String(lo)+" - / "+String(hithr)+" "+String(lothr)+" - * "+String(lothr)+" + >= 255 0 ? ? ?"

    Expr(s, rmask, mexpr, scale_inputs="all")
}
def mt_expand_multi(src, mode='rectangle', planes=None, sw=1, sh=1):
    if not isinstance(src, vs.VideoNode):
        raise vs.Error('mt_expand_multi: This is not a clip')

    if sw > 0 and sh > 0:
        mode_m = [0, 1, 0, 1, 1, 0, 1, 0] if mode == 'losange' or (mode == 'ellipse' and (sw % 3) != 1) else [1, 1, 1, 1, 1, 1, 1, 1]
    elif sw > 0:
        mode_m = [0, 0, 0, 1, 1, 0, 0, 0]
    elif sh > 0:
        mode_m = [0, 1, 0, 0, 0, 0, 1, 0]
    else:
        mode_m = None

    if mode_m is not None:
        return mt_expand_multi(core.std.Maximum(src, planes=planes, coordinates=mode_m), mode=mode, planes=planes, sw=sw - 1, sh=sh - 1)
    else:
        return src

def mt_inpand_multi(src, mode='rectangle', planes=None, sw=1, sh=1):
    if not isinstance(src, vs.VideoNode):
        raise vs.Error('mt_inpand_multi: This is not a clip')

    if sw > 0 and sh > 0:
        mode_m = [0, 1, 0, 1, 1, 0, 1, 0] if mode == 'losange' or (mode == 'ellipse' and (sw % 3) != 1) else [1, 1, 1, 1, 1, 1, 1, 1]
    elif sw > 0:
        mode_m = [0, 0, 0, 1, 1, 0, 0, 0]
    elif sh > 0:
        mode_m = [0, 1, 0, 0, 0, 0, 1, 0]
    else:
        mode_m = None

    if mode_m is not None:
        return mt_inpand_multi(core.std.Minimum(src, planes=planes, coordinates=mode_m), mode=mode, planes=planes, sw=sw - 1, sh=sh - 1)
    else:
        return src
Function mt_expand_multi (clip src, int "thY", int "thC", string "mode",
\   int "offx", int "offy", int "w", int "h", int "y", int "u", int "v",
\   string "chroma", int "sw", int "sh")
{
    sw   = Default (sw, 1)
    sh   = Default (sh, 1)
    mode = Default (mode, "rectangle")

    mode_m =
\     (sw > 0 && sh > 0) ? (
\         (mode == "losange" || (mode == "ellipse" && (sw % 3) != 1))
\       ? "both" : "square"
\                          )
\   : (sw > 0          ) ? "horizontal"
\   : (          sh > 0) ? "vertical"
\   :                      ""

    (mode_m != "") ? src.mt_expand (
\       thY=thY, thC=thC, mode=mode_m,
\       offx=offx, offy=offy, w=w, h=h, y=y, u=u, v=v, chroma=chroma
\   ).mt_expand_multi (
\       thY=thY, thC=thC, mode=mode,
\       offx=offx, offy=offy, w=w, h=h, y=y, u=u, v=v, chroma=chroma,
\       sw=sw-1, sh=sh-1
\   ) : src
}

Function mt_inpand_multi (clip src, int "thY", int "thC", string "mode",
\   int "offx", int "offy", int "w", int "h", int "y", int "u", int "v",
\   string "chroma", int "sw", int "sh")
{
    sw   = Default (sw, 1)
    sh   = Default (sh, 1)
    mode = Default (mode, "rectangle")

    mode_m =
\     (sw > 0 && sh > 0) ? (
\         (mode == "losange" || (mode == "ellipse" && (sw % 3) != 1))
\       ? "both" : "square"
\                          )
\   : (sw > 0          ) ? "horizontal"
\   : (          sh > 0) ? "vertical"
\   :                      ""

    (mode_m != "") ? src.mt_inpand (
\       thY=thY, thC=thC, mode=mode_m,
\       offx=offx, offy=offy, w=w, h=h, y=y, u=u, v=v, chroma=chroma
\   ).mt_inpand_multi (
\       thY=thY, thC=thC, mode=mode,
\       offx=offx, offy=offy, w=w, h=h, y=y, u=u, v=v, chroma=chroma,
\       sw=sw-1, sh=sh-1
\   ) : src
}

Now 8-bit has ~290 fps and >8-bit has ~215 fps (same speed). mt_expand/inpand and core.std.Maximum/Minimum have identical output and ~same speed.

pinterf commented 4 years ago

So the speed of 8 bit version has been decreased and has become even slower by making it identical to the VS version?

TbtBI commented 4 years ago
Function Dither_build_gf3_range_mask (clip src, int radius)
{
    src
    ma  = (radius >  1) ? mt_expand_multi (sw=radius, sh=radius, mode="ellipse") : last
    mi  = (radius >  1) ? mt_inpand_multi (sw=radius, sh=radius, mode="ellipse") : last

    (radius >  1) ? mt_lutxy (ma, mi, "x y -")
\                 : mt_edge (mode="min/max", thY1=0, thY2=255)
}

So debandmask(60,16,4,4,1) didn't use mt_expand/inpand_multi that's why 8-bit was so fast. But...

Function debandmask(clip c, int "lo", int "hi", int "lothr", int "hithr", int "mrad", bool "stack16_in", bool "stack16_out")
{
    lo = Default(lo, 60)
    hi = Default(hi, 16)
    lothr = Default(lothr, 4)
    hithr = Default(hithr, 4)
    mrad = Default(mrad, 1)
    stack16_in = Default(stack16_in, false)
    stack16_out = Default(stack16_out, false)

    Assert(lo >= 0 && hi >= 0 && lothr >= 0 && hithr >= 0,         "debandmask: lo/hi/lothr/hithr must be >= 0")
    Assert(lo <= 255 && hi <= 255 && lothr <= 255 && hithr <= 255, "debandmask: lo/hi/lothr/hithr must be <= 255")

    stack16_in == false ? c : ConvertFromStacked(c)
    s = ExtractY()

    ma = mt_expand_multi(s, mode="ellipse", sw=mrad, sh=mrad)
    mi = mt_inpand_multi(s, mode="ellipse", sw=mrad, sh=mrad)

    rmask = Expr(ma, mi, "x y -")

    mexpr = "x "+String(lo)+" < y "+String(lothr)+" >= 255 0 ? x "+String(hi)+" > y "+String(hithr)+" >= 255 0 ? y x "+String(lo)+" - "+String(hi)+" "+String(lo)+" - / "+String(hithr)+" "+String(lothr)+" - * "+String(lothr)+" + >= 255 0 ? ? ?"

    Expr(s, rmask, mexpr, scale_inputs="all")
}

debandmask(60,16,4,4,1)

8-bit: 278.3; 16-bit: 213.2

Function debandmask(clip c, int "lo", int "hi", int "lothr", int "hithr", int "mrad", bool "stack16_in", bool "stack16_out")
{
    lo = Default(lo, 60)
    hi = Default(hi, 16)
    lothr = Default(lothr, 4)
    hithr = Default(hithr, 4)
    mrad = Default(mrad, 1)
    stack16_in = Default(stack16_in, false)
    stack16_out = Default(stack16_out, false)

    Assert(lo >= 0 && hi >= 0 && lothr >= 0 && hithr >= 0,         "debandmask: lo/hi/lothr/hithr must be >= 0")
    Assert(lo <= 255 && hi <= 255 && lothr <= 255 && hithr <= 255, "debandmask: lo/hi/lothr/hithr must be <= 255")

    stack16_in == false ? c : ConvertFromStacked(c)
    s = ExtractY()

    ma = mt_expand_multi(s, mode="ellipse", sw=mrad, sh=mrad)
    mi = mt_inpand_multi(s, mode="ellipse", sw=mrad, sh=mrad)

    rmask = mt_lutxy(ma, mi, "x y -")

    mexpr = "x "+String(lo)+" < y "+String(lothr)+" >= 255 0 ? x "+String(hi)+" > y "+String(hithr)+" >= 255 0 ? y x "+String(lo)+" - "+String(hi)+" "+String(lo)+" - / "+String(hithr)+" "+String(lothr)+" - * "+String(lothr)+" + >= 255 0 ? ? ?"

    mt_lutxy(s, rmask, mexpr, scale_inputs="all")
}

debandmask(60,16,4,4,1)

8-bit: 439.1; 16-bit 4.1 fps

Btw VS and AVS+ mt_expand/inpand_multi have identical output ans same speed.

pinterf commented 4 years ago

Then it's Expr? scale_inputs='all' converts non-8 bit data behind the scenes into 8 bit in order to leave the original expression created many years ago as-is. This way it is much easier for the one who ports the scripts but it can be smaller. You'll notice that it's the parameters who got scaled in all VS expressions and not the pixels itself. After the calculation is done, the result is scaled back to 16 bit range. These operations are an additional mul 1./256.0f for each input pixels and a "256.0" at the end of the expression. I wonder what happens if you are using the VS expression and pass the 16 bit-like parameters, thus cutting the 'convenience' overhead

TbtBI commented 4 years ago
Function debandmask(clip c, int "lo", int "hi", int "lothr", int "hithr", int "mrad", bool "stack16_in", bool "stack16_out")
{
    lo = Default(lo, 60)
    hi = Default(hi, 16)
    lothr = Default(lothr, 4)
    hithr = Default(hithr, 4)
    mrad = Default(mrad, 1)
    stack16_in = Default(stack16_in, false)
    stack16_out = Default(stack16_out, false)

    #Assert(lo >= 0 && hi >= 0 && lothr >= 0 && hithr >= 0,         "debandmask: lo/hi/lothr/hithr must be >= 0")
    #Assert(lo <= 255 && hi <= 255 && lothr <= 255 && hithr <= 255, "debandmask: lo/hi/lothr/hithr must be <= 255")

    stack16_in == false ? c : ConvertFromStacked(c)
    s = ExtractY()

    ma = mt_expand_multi(s, mode="ellipse", sw=mrad, sh=mrad)
    mi = mt_inpand_multi(s, mode="ellipse", sw=mrad, sh=mrad)

    rmask = Expr(ma, mi, "x y -")

    mexpr = "x "+String(lo)+" < y "+String(lothr)+" >= 65535 0 ? x "+String(hi)+" > y "+String(hithr)+" >= 65535 0 ? y x "+String(lo)+" - "+String(hi)+" "+String(lo)+" - / "+String(hithr)+" "+String(lothr)+" - * "+String(lothr)+" + >= 65535 0 ? ? ?"

    Expr(s, rmask, mexpr)
}
debandmask(15360,4096,1024,1024,1)

16-bit: 240.1

pinterf commented 4 years ago

I don't know without seeing it. Is your input Colorbars or BlankClip? I'd try omitting last Expr: return rmask? So I'd try to simplify the whole script until I see the key difference. 2x 4x difference is a lot! And since the basics are the same there must be a reason for this deviation.

TbtBI commented 4 years ago

Replacing last line Expr(s, rmask, mexpr) with rmask - 16-bit: 456.7 fps. Here the sample I tested with - https://gofile.io/?c=AOjlJC

pinterf commented 4 years ago

I give it up. I had to implement avx2 for mt_expand/inpand but the gain is not much in this script. It's not bad either, since mt_expand/mt_inpand got a 1.5x speed boost if inspecting alone with blankclip, probably other scripts can benefit.

For benchmarking Avsmeter64 2.9.9.1 and VSEdit benchmark was used.

processing BlankClip 640x480 yuv420P16 with debandmask: avs+: 1080 (sse4.1) 1140 (avx2 at mt_expand/inpand) vs: 910

process ffms2 video with debandmask: avs+: 113 (sse4.1 ) 116 (avx2 at mt_expand/inpand) vs: 196

ffms2 + 16bit convert only: avs+ 234 fps vs: 205 fps

just trying with an extra simple filter ffms2 + 16bit convert + Invert avs+ 219 fps vs: 192 fps

I can only imagine that ffms2 is receiving more linear frame requests in VapourSynth for this specific script, thus is quicker.

TbtBI commented 4 years ago

I see.

Thanks for you time.

Last thing... Is there any chance to get a decent speed with mt_lutxy 10-16bits? It's better to use mt_lutxy instead Expr for 8-bit (for that script):

**Expr**
AVSMeter 2.9.9.1 (x64), 2012-2020, (c) Groucho2004
AviSynth+ 3.5 (r3132, master, x86_64) (3.5.0.0)

Number of frames:                     4131
Length (hh:mm:ss.ms):         00:02:52.125
Frame width:                          1920
Frame height:                          804
Framerate:                          24.000 (24/1)
Colorspace:                             Y8

Frames processed:                   4131 (0 - 4130)
FPS (min | max | average):          77.08 | 320.0 | 304.4
Process memory usage (max):         174 MiB
Thread count:                       34
CPU usage (average):                25.5%

Time (elapsed):                     00:00:13.570

AVSMeter 2.9.9.1 (x64), 2012-2020, (c) Groucho2004
AviSynth+ 3.5 (r3132, master, x86_64) (3.5.0.0)

Number of frames:                     4131
Length (hh:mm:ss.ms):         00:02:52.125
Frame width:                          1920
Frame height:                          804
Framerate:                          24.000 (24/1)
Colorspace:                             Y8

Frames processed:                   4131 (0 - 4130)
FPS (min | max | average):          126.1 | 318.7 | 305.2
Process memory usage (max):         174 MiB
Thread count:                       34
CPU usage (average):                25.0%

Time (elapsed):                     00:00:13.533
**mt_lutxy**
AVSMeter 2.9.9.1 (x64), 2012-2020, (c) Groucho2004
AviSynth+ 3.5 (r3132, master, x86_64) (3.5.0.0)

Number of frames:                     4131
Length (hh:mm:ss.ms):         00:02:52.125
Frame width:                          1920
Frame height:                          804
Framerate:                          24.000 (24/1)
Colorspace:                             Y8

Frames processed:                   4131 (0 - 4130)
FPS (min | max | average):          108.6 | 483.2 | 450.4
Process memory usage (max):         173 MiB
Thread count:                       34
CPU usage (average):                32.3%

Time (elapsed):                     00:00:09.172

AVSMeter 2.9.9.1 (x64), 2012-2020, (c) Groucho2004
AviSynth+ 3.5 (r3132, master, x86_64) (3.5.0.0)

Number of frames:                     4131
Length (hh:mm:ss.ms):         00:02:52.125
Frame width:                          1920
Frame height:                          804
Framerate:                          24.000 (24/1)
Colorspace:                             Y8

Frames processed:                   4131 (0 - 4130)
FPS (min | max | average):          114.8 | 484.2 | 453.4
Process memory usage (max):         174 MiB
Thread count:                       34
CPU usage (average):                32.3%

Time (elapsed):                     00:00:09.112

mt_lutxy is faster and more efficient. Maybe something similar could be achieved for high bit depth too?

pinterf commented 4 years ago

It depends on the memory size. 16 bit lutxy requires 65536 x 65536 x 2 bytes (8 GB) . And it probably quite slow to fill up this initial table. So at 16 bits: no.

For 10 bit data a 2D lut requires only 1024x1024x2 bytes and has to be initialized with only 1 million expressions. I think at 10 bits masktools would still use its own lutxy when useExpr=1.

If you have large memory and like adventures you can force 16 bit mt_lutxy to use real LUT (LookUp Table) by realcalc=false (if I remember well on the parameter name) I have tried it when I developed realCalc, of course you have to have an x64 system

pinterf commented 4 years ago

For really simple expressions and with avx2 I can imagine that Expr can be faster than lutxy. I don't know your test expr neither if you have an avx2 cpu.

TbtBI commented 4 years ago

I tried realtime=true (16-bit). I waited 8 min. and canceled AVSMeter.

I used these scripts for the results in my previous comment + the sample I shared 3 days ago:

FFVideoSource("ra.mkv")
Function debandmask(clip c, int "lo", int "hi", int "lothr", int "hithr", int "mrad", bool "stack16_in", bool "stack16_out")
{
    lo = Default(lo, 60)
    hi = Default(hi, 16)
    lothr = Default(lothr, 4)
    hithr = Default(hithr, 4)
    mrad = Default(mrad, 1)
    stack16_in = Default(stack16_in, false)
    stack16_out = Default(stack16_out, false)

    Assert(lo >= 0 && hi >= 0 && lothr >= 0 && hithr >= 0,         "debandmask: lo/hi/lothr/hithr must be >= 0")
    Assert(lo <= 255 && hi <= 255 && lothr <= 255 && hithr <= 255, "debandmask: lo/hi/lothr/hithr must be <= 255")

    stack16_in == false ? c : ConvertFromStacked(c)
    s = ExtractY()

    ma = mt_expand_multi(s, mode="ellipse", sw=mrad, sh=mrad)
    mi = mt_inpand_multi(s, mode="ellipse", sw=mrad, sh=mrad)

    rmask = Expr(ma, mi, "x y -")

    mexpr = "x "+String(lo)+" < y "+String(lothr)+" >= 255 0 ? x "+String(hi)+" > y "+String(hithr)+" >= 255 0 ? y x "+String(lo)+" - "+String(hi)+" "+String(lo)+" - / "+String(hithr)+" "+String(lothr)+" - * "+String(lothr)+" + >= 255 0 ? ? ?"

    Expr(s, rmask, mexpr, scale_inputs="all")
}

debandmask(60,16,4,4,1)
FFVideoSource("ra.mkv")
Function debandmask(clip c, int "lo", int "hi", int "lothr", int "hithr", int "mrad", bool "stack16_in", bool "stack16_out")
{
    lo = Default(lo, 60)
    hi = Default(hi, 16)
    lothr = Default(lothr, 4)
    hithr = Default(hithr, 4)
    mrad = Default(mrad, 1)
    stack16_in = Default(stack16_in, false)
    stack16_out = Default(stack16_out, false)

    Assert(lo >= 0 && hi >= 0 && lothr >= 0 && hithr >= 0,         "debandmask: lo/hi/lothr/hithr must be >= 0")
    Assert(lo <= 255 && hi <= 255 && lothr <= 255 && hithr <= 255, "debandmask: lo/hi/lothr/hithr must be <= 255")

    stack16_in == false ? c : ConvertFromStacked(c)
    s = ExtractY()

    ma = mt_expand_multi(s, mode="ellipse", sw=mrad, sh=mrad)
    mi = mt_inpand_multi(s, mode="ellipse", sw=mrad, sh=mrad)

    rmask = mt_lutxy(ma, mi, "x y -", realtime=false)

    mexpr = "x "+String(lo)+" < y "+String(lothr)+" >= 255 0 ? x "+String(hi)+" > y "+String(hithr)+" >= 255 0 ? y x "+String(lo)+" - "+String(hi)+" "+String(lo)+" - / "+String(hithr)+" "+String(lothr)+" - * "+String(lothr)+" + >= 255 0 ? ? ?"

    mt_lutxy(s, rmask, mexpr, scale_inputs="all", realtime=false)
}

debandmask(60,16,4,4,1)

Here my setup info:

[OS/Hardware info]
Operating system:           Windows 10 (x64) (Build 17763)

CPU:                        Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz / Skylake-X (Core i9)
                            MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, FMA3, AVX, AVX2, F16C, AVX512F, AVX512DQ, AVX512CD, AVX512BW, AVX512VL, BMI1, BMI2
                            16 physical cores / 16 logical cores

[Avisynth info]
VersionString:              AviSynth+ 3.5 (r3132, master, x86_64)
VersionNumber:              2.60
File / Product version:     3.5.0.0 / 3.5.0.0
Interface Version:          6
Multi-threading support:    Yes
Avisynth.dll location:      C:\Windows\SYSTEM32\avisynth.dll
Avisynth.dll time stamp:    2020-04-02, 13:15:15 (UTC)
PluginDir2_5 (HKLM, x64):   C:\Program Files (x86)\AviSynth+\plugins64
PluginDir+   (HKLM, x64):   C:\Program Files (x86)\AviSynth+\plugins64+
pinterf commented 4 years ago

I tried realtime=true (16-bit). I waited 8 min. and canceled AVSMeter.


Yes, not easy, first it has to complete with the 2^32 expression evaluation to fill up LUT slots.
When you try with a simple "x y -" alone and nothing else then probably you get results within minutes. Only _then_ is fast. You can extrapolate the time needed for 16 bit lutxy from a 10bit case with realcalc=true and multiply the time by 64x64=4096.
TbtBI commented 4 years ago

Thanks for the info.