pinterf / mvtools

mvtools plugin for avisynth
155 stars 17 forks source link

Added different predictors proc and new AVX2 SAD #45

Open DTL2020 opened 3 years ago

DTL2020 commented 3 years ago

Added different predictors proc and new AVX2 SAD for ExhaustiveSearch

pinterf commented 3 years ago

Does this second commit replace the previous one, so are they doing the same just the new one is quicker/uses less temporary storage? Really needs human readable C implementation of these black boxes without any tricks inside.

DTL2020 commented 3 years ago

Greetings !

Todays commint it is my long awaiting first release of idea about full Esa search inside SIMD register file (at least it only reloads ref lines from cache at V-steps and not produce immediate write traffic - only 1 final output write result). It possibly perspective for future multicores chips because it creates much less write traffic (that cause cache coherence protocol transactions between cores I think). Though it uses a bit more SIMD AVX2 instructions. Also it uses much less intermetiate memory for immediate SADs of all search positions array and it is good step to the future of 'array of blocks vs ref plane' search (need only 1 SAD and 1 x,y memory output per block). Also it skips simple but may be still slow 2 loops search inside 5x5 (or 9x8 for r=4) immediate array for minsad.

About final output - I use m256 write instruction because do not found AVX2 for writing 64bits only. As I remember it may be penalty fo using SSE instruction for writing half of SSE-128 register after AVX2-256 - may I wrong.

I make it this morning and also debug with SDE - I go to a day work this evening and will have a time at work to test many already comittet tests tomorrow. It was an idea to write you a message about new Esa search functions update after more tests with real 6core AVX2 CPU for both speed and output MDegrainN results and encoding results. I will try to write more results tomorrow till end of day (in eupore).

About C-reference - as it is very simple I still not dot it as promise. I hope tomorrow. It all about: Old processing for(dy = -r; dy < +r; dy++) for(dx = -r; dx < +r; dx++) ArraySADs[dx, dy] = CalcSAD(dx, dy); } }

+make search for minsad and save minsad_x and minsad_y in C-loops

New method: minsad=max_sad_value; xcurrent=0, ycurrent=0; for(dy = -r; dy < +r; dy++) for(dx = -r; dx < +r; dx++) sad = CalcSAD(dx, dy); if (sad < minsad) { minsad=sad; x_minsad=xcurrent; y_minsad=xcurrent; } xcurrent+=1; } ycurrent+=1; }

save ymm_reg with x_minsad, y_minsad, minsad and parse.

On 10/19/21, Ferenc Pintér @.***> wrote:

Does this second commit replace the previous one, so are they doing the same just the new one is quicker/uses less temporary storage? Really needs human readable C implementation of these black boxes without any tricks inside.

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/pinterf/mvtools/pull/45#issuecomment-946771703

pinterf commented 3 years ago

(Note: Until I upload an .editorconfig file (same folder as .gitignore), can you set your editors to use spaces (not tabs), indent size=2?) Or you can create your own with the following content, and Visual Studio will read and use it for the project:

root = true

# NOTE: editorconfig-core-c has a hardcoded limitation of 50 chars per .ini section name

[*.{cpp,h,hpp,c,asm}]
indent_style = space
indent_size = 2
trim_trailing_whitespace = true
end_of_line = crlf
insert_final_newline = true

[*.md]
indent_style = space
trim_trailing_whitespace = false
pinterf commented 3 years ago

Integrated by cherry pick directly from your repo. Usually this is the way for codes which are not ready-to-merge. Howto: I'm using VS IDE. From developer command prompt:

git remote add DTLmvtools https://github.com/DTL2020/mvtools.git
git fetch DTLmvtools

then cherry picked from IDE and continued with the integration.

You'll need optsearchoption = 1 for MAnalyze and MRecalculate My test script resulted in 13,72 fps with, 12.99 fps without the option.

clip=ffms2(film)

ov=4
blksize = 8
blksizev = 8
optSearchOption = 0
chroma=false

sup = MSuper(clip, pel=2)
fv1 = MAnalyse(sup, isb=False, delta=1, overlap=ov, blksize = blksize, blksizev = blksizev, optSearchOption = optSearchOption, chroma=chroma )
bv1 = MAnalyse(sup, isb=True, delta=1, overlap=ov, blksize = blksize, blksizev = blksizev, optSearchOption = optSearchOption, chroma=chroma )
fv2 = MAnalyse(sup, isb=False, delta=2, overlap=ov, blksize = blksize, blksizev = blksizev, optSearchOption = optSearchOption, chroma=chroma )
bv2 = MAnalyse(sup, isb=True, delta=2, overlap=ov, blksize = blksize, blksizev = blksizev, optSearchOption = optSearchOption, chroma=chroma )
fv3 = MAnalyse(sup, isb=False, delta=3, overlap=ov, blksize = blksize, blksizev = blksizev, optSearchOption = optSearchOption, chroma=chroma )
bv3 = MAnalyse(sup, isb=True, delta=3, overlap=ov, blksize = blksize, blksizev = blksizev, optSearchOption = optSearchOption, chroma=chroma )
clip = MDegrain3(clip, sup, bv1, fv1, bv2, fv2, bv3, fv3, thSAD=1000)
clip
DTL2020 commented 3 years ago

I uploaded new commint with C-ref functions and some fixes to _avx2. Do it correct or I need to take (fork) latest repository and add new files to it and create new pull request ? About new full-SIMD minSAD search - unfortunately it still shows worst speed (in compare with ArrayofSADs and C-search of minsad). But I found the minSAD search is also some time-critical part of function so I try to made currently half-SIMD and half-C version. I commit it to separate branch and it need debug. So in the last commit to main branch these new functions removed. Addition: shame on me - I finally found already existed 'levels' param on MAnalyse() so will try to see how it is need to set to get nLevelsMax to 1 and 2.

DTL2020 commented 3 years ago

I delete and fork my repository again to have new copy with a places where to put C-ref functions and corrected _avx2. So need to open new pull-request with new commited changes or it will adds to this ?

DJATOM commented 3 years ago

You can reset your commits in SourceTree as shown on the screenshot (or via git cli, but I don't remember exact command) изображение Then force-push to your fork Then pull changes from main repo (via github interface) Then add your new changes I suggest to make new branches for pull requests, since it's not convenient to pullback your unaccepted commits every time.

DTL2020 commented 3 years ago

Oh - I see pinterf already merge my latest commit. The only comment (it looks it not works in web-interface of github ?) - line 803 of PlaneofBlocks_avx2.cpp - https://github.com/pinterf/mvtools/blob/0361d4bbd747f643a584323f8101a884786bebcd/Sources/PlaneOfBlocks_avx2.cpp#L802 It looks I or you forgot to add prefetch line: _mm_prefetch(const_cast<const CHAR>(reinterpret_cast<const CHAR>(pucRef + nRefPitch[0] * (i + 8))), _MM_HINT_NTA); // prefetch next Ref row

(same as now in line 114 in sp4 function).

I have more ideas for optimize of Esa_avx2 search (more compacting storage of current and ref data for sad calculation and using of 2 _mm256_sad_epu8() instructions instead of 4 for 8x8 block and may be more attempts to use SIMD minsad instruction instead of 25-points C-loop search). Will try to make tests if it help to speed to the end of Saturday 23.10.2021.

pinterf commented 3 years ago

The merge was full manual, since the commits and changes involved different files and coding conventions. Seems I missed that line. Now that big changes happen almost daily, I'm gonna leave a longer time until merge to have things tested and settle down.

DTL2020 commented 3 years ago

(Note: Until I upload an .editorconfig file (same folder as .gitignore), can you set your editors to use spaces (not tabs), indent size=2?) Or you can create your own with the following content, and Visual Studio will read and use it for the project:

root = true

# NOTE: editorconfig-core-c has a hardcoded limitation of 50 chars per .ini section name

[*.{cpp,h,hpp,c,asm}]
indent_style = space
indent_size = 2
trim_trailing_whitespace = true
end_of_line = crlf
insert_final_newline = true

[*.md]
indent_style = space
trim_trailing_whitespace = false

I load all new commits and this new file. But after I commit new edits to separate branch - https://github.com/DTL2020/mvtools/commit/f23f587782d4e03d5b55cfbc113ede2e90014f14 It strangely changes PlaneOfBlocks.cpp file with almost all old deletions and all new additions. Is it correct and as expected ? I use VS2017 at home.

DTL2020 commented 3 years ago

Line 206 of PlaneofBlocks.cpp - used and instead of && https://github.com/pinterf/mvtools/blob/bbc2ac766abdf716d7191518383afeb448f1fc96/Sources/PlaneOfBlocks.cpp#L206 It confuses Intel C++ 19.1 compiler. But looks like works with VS2019.

Also found the main issue of the new _avx2 search functions - they work correctly only with nPel=1. Because Pel=2 uses different x,y of vector to memory pointer of start address of each ref block conversion (not contiguous even in Horizontal x-direction). In old functions ExpandingSearch (and in c_ref for _avx2) it is done by calling of GetRefBlock() for each ref block in the search. And current _avx2 functions load all columns of ref plane expecting it is contigous in memory.

So need 1 more check for use. For nPel=2 it looks also possible to make separate set of functions but it need completely different 'gathering' of input data or re-design pel=2 plane storage to be contigous in memory (?).

pinterf commented 3 years ago

aaaaa, only noticed for fifth re-read. So yes, there is 'and' instead of && :)
Shame on me, I'm parallelly working on a Delphi project :)

pinterf commented 3 years ago

I load all new commits and this new file. But after I commit new edits to separate branch - DTL2020@f23f587 It strangely changes PlaneOfBlocks.cpp file with almost all old deletions and all new additions. Is it correct and as expected ? I use VS2017 at home.

Dunno. Line endings??? CRLF vs LF?

DTL2020 commented 3 years ago

I make new commit to main branch - https://github.com/DTL2020/mvtools/commit/0f3e9d7eb4c2d9575f5b8ee871ff0c4a5e6fcc43 . Now the sp2 going to be faster and also for nPel=1 added sp1 using latest method of calculating sad and searching for minimum. Sp4 still old and sp3 need to be added to have all sp 1,2,3,4 filled for nPel=1 at least.

It looks like this commit not listed here. Do I need to close this and open new pull-request ? Also it looks editor again change something in all lines of files and github list almost all lines as changed - do not know what happens.

pinterf commented 3 years ago

I'm gonna then cherry-pick again. I've still got your repo as remote so I can easily access your commits.

DTL2020 commented 3 years ago

I load all new commits and this new file. But after I commit new edits to separate branch - DTL2020@f23f587 It strangely changes PlaneOfBlocks.cpp file with almost all old deletions and all new additions. Is it correct and as expected ? I use VS2017 at home.

Dunno. Line endings??? CRLF vs LF?

As I see it mostly about line beginnings - almost every line is shifted (to the left or right). May be it re-format all document to indent_style = space indent_size = 2 ?

pinterf commented 3 years ago

I hope that was the last time when I needed to cherry-pick. Commited a lot, including (hopefully) all your changes, uniformizing spaces and line endings. It was easy when I commited alone but now there must be order. I think you'll need to re-pull the repo.

DTL2020 commented 3 years ago

Yes - it looks most new is integrated.

Some known inconsistency between C-refs and _avx2 Esa search functions: If there are more 1 equal minsad value the C-ref return last found x,y pair and _avx2 looks like returns first (as the instruction _mm_minpos_epu16() internal logic). So if more equal result required - may be make C-ref reverse order scan (from botom right to top left ) ? Though it is rare enough situation and may be not any danger for MDegrainN because if 2 or more blocks have equal sad they possibly can be blended without distortions. But the some 'intelectual logic' may treat as different motion vectors. Also the old Expanding search may be make different result because of different scan order (from center of area to the edges) and _avx2 uses scan from top-left corner. So it also creates different result if 'workarea' contain >1 blocks with completely equal sad value. Also the Expanding Search from center uses 'cost of new minsad' compare at each search step (CheckMV()) and new _avx2 only at the end of full scan of area.

pinterf commented 3 years ago

Once I had a week investigating why different runs sometimes yield different output. (Debugging nightmare: we get a slightly different vectors for once in ten thousand frames, happening once in 10 runs) Well, when avstp is used, some search 'exit' logic depends on which thread (which group of lines of blocks) reaches a specific threshold sooner. The result is not 100% deterministic and cannot be used as a ground truth when I'd compare changes in the code. (I we'd had such unit test it would definitely fail sometimes). Look at my // MAnalyze mt-inconsistency reason comments in PlaneOfBlocks.cpp