wang-bin / mdk-braw

blackmagic raw plugin for libmdk https://github.com/wang-bin/mdk-sdk used by gyroflow
MIT License
1 stars 0 forks source link

Fix OpenCL pipeline #3

Closed AdrianEddy closed 10 months ago

AdrianEddy commented 10 months ago

This unfortunately requires 2 copies (GPU->GPU then GPU->CPU) but this is the only way I was able to find to make it work.

Additionally, when the decoding fails and seeking is in progress, it should still emit seekComplete, because ProcessComplete will never be called.

I recommend to enable "Hide whitespace" in the diff view to make it easier to review the changes

wang-bin commented 10 months ago

What about this patch(processedResCpu_ is not released)

diff --git a/BRawReader.cpp b/BRawReader.cpp
index b94fe57..358d858 100644
--- a/BRawReader.cpp
+++ b/BRawReader.cpp
@@ -89,7 +89,8 @@ private:
     ComPtr<IBlackmagicRawPipelineDevice> dev_;
     ComPtr<IBlackmagicRawResourceManager> resMgr_;
     ComPtr<IBlackmagicRawClip> clip_;
-    void* processedRes_ = nullptr; // cpu readable(gpu writable?) in copy mode
+    void* processedRes_ = nullptr; // cpu readable gpu writable in copy mode for UMA/NUMA
+    void* processedResCpu_ = nullptr; // cpu readable cpu writable in copy mode for NUMA
     BlackmagicRawResourceType processedType_ = 0;
     BlackmagicRawPipeline pipeline_ = blackmagicRawPipelineCPU; // set by user only
     BlackmagicRawInterop interop_ = blackmagicRawInteropNone;
@@ -595,21 +596,25 @@ void BRawReader::ProcessComplete(IBlackmagicRawJob* procJob, HRESULT result, IBl
             // iOS: -[MTLToolsResource validateCPUWriteable]:135: failed assertion `resourceOptions (0x20) specify MTLResourceStorageModePrivate, which is not CPU accessible.'
             MS_WARN(resMgr_->GetResourceHostPointer(context, cmdQueue, res, type, (void**)&imageData[0])); // metal can get host ptr?
 #endif
-            if (!imageData[0]) { // cuda, ocl
-                if (!processedRes_) {
+            if (!imageData[0]) { // cuda, ocl. it's a device local memory, not cpu readable
+                if (!processedRes_) { // Try to use UMA, create a cpu readable gpu writable resource, then direct access via cpu
                     processedType_ = type;
                     clog << "try CPU readable GPU writable memory for " << FOURCC_name(type) << endl;
                     MS_ENSURE(resMgr_->CreateResource(context, cmdQueue, sizeBytes, type, blackmagicRawResourceUsageReadCPUWriteGPU, &processedRes_));
                     MS_WARN(resMgr_->GetResourceHostPointer(context, cmdQueue, processedRes_, type, (void**)&imageData[0])); // why host ptr is null?
-                    if (!imageData[0]) {
+                    if (!imageData[0]) { // NUMA, requires a stage buffer, which is gpu writable cpu readable
                         clog << "try CPU readable CPU writable memory for " << FOURCC_name(type) << endl;
-                        MS_WARN(resMgr_->ReleaseResource(context, cmdQueue, processedRes_, processedType_));
-                        MS_ENSURE(resMgr_->CreateResource(context, cmdQueue, sizeBytes, type, blackmagicRawResourceUsageReadCPUWriteCPU, &processedRes_)); // processed image is on cpu readable memory?
+                        MS_ENSURE(resMgr_->CreateResource(context, cmdQueue, sizeBytes, type, blackmagicRawResourceUsageReadCPUWriteCPU, &processedResCpu_)); // processed image is on cpu readable memory?
+                        MS_WARN(resMgr_->GetResourceHostPointer(context, cmdQueue, processedResCpu_, type, (void**)&imageData[0]));
                     }
                 }
-                MS_WARN(resMgr_->GetResourceHostPointer(context, cmdQueue, processedRes_, type, (void**)&imageData[0]));
-                if (imageData[0])
+                if (processedResCpu_) {
                     MS_ENSURE(resMgr_->CopyResource(context, cmdQueue, res, type, processedRes_, type, sizeBytes, false));
+                    MS_ENSURE(resMgr_->CopyResource(context, cmdQueue, processedRes_, type, processedResCpu_, blackmagicRawResourceTypeBufferCPU, sizeBytes, false));
+                } else {
+                    MS_ENSURE(resMgr_->CopyResource(context, cmdQueue, res, type, processedRes_, type, sizeBytes, false));
+                }
+            } else { // UMA, resource is already cpu readable
             }
     // TODO: less copy via [MTLBuffer newBufferWithBytesNoCopy:length:options:deallocator:] from VideoFrame.buffer(0)
             //clog << FOURCC_name(type) << " processedRes_ " << processedRes_ << " res " << res << " imageData: " << (void*)imageData[0] << endl;

I guess for iGPU, we can use UMA optimization, get host pointer from gpu writable cpu readable memory. For dGPU, 2 steps are required, like d3d stage buffer for uploading/downloading a resource, this will also work for UMA, but higher gpu load. So you have to test all these cases if you have both iGPU and dGPU in your OS, via BRAW:device=deviceName, available device names can be seen in the log , device:

AdrianEddy commented 10 months ago

Did you test this? This doesn't seem to work for me with OpenCL pipeline. seems like imageData[0] is not updated when processedRes_ is already a valid pointer?

AdrianEddy commented 10 months ago
diff --git a/BRawReader.cpp b/BRawReader.cpp
index b94fe57..0069907 100644
--- a/BRawReader.cpp
+++ b/BRawReader.cpp
@@ -90,6 +90,7 @@ private:
     ComPtr<IBlackmagicRawResourceManager> resMgr_;
     ComPtr<IBlackmagicRawClip> clip_;
     void* processedRes_ = nullptr; // cpu readable(gpu writable?) in copy mode
+    void* processedResCpu_ = nullptr; // cpu buffer for OpenCL copy
     BlackmagicRawResourceType processedType_ = 0;
     BlackmagicRawPipeline pipeline_ = blackmagicRawPipelineCPU; // set by user only
     BlackmagicRawInterop interop_ = blackmagicRawInteropNone;
@@ -603,13 +604,16 @@ void BRawReader::ProcessComplete(IBlackmagicRawJob* procJob, HRESULT result, IBl
                     MS_WARN(resMgr_->GetResourceHostPointer(context, cmdQueue, processedRes_, type, (void**)&imageData[0])); // why host ptr is null?
                     if (!imageData[0]) {
                         clog << "try CPU readable CPU writable memory for " << FOURCC_name(type) << endl;
-                        MS_WARN(resMgr_->ReleaseResource(context, cmdQueue, processedRes_, processedType_));
-                        MS_ENSURE(resMgr_->CreateResource(context, cmdQueue, sizeBytes, type, blackmagicRawResourceUsageReadCPUWriteCPU, &processedRes_)); // processed image is on cpu readable memory?
+                        processedResCpu_ = (void*) new uint8_t[sizeBytes];
                     }
                 }
-                MS_WARN(resMgr_->GetResourceHostPointer(context, cmdQueue, processedRes_, type, (void**)&imageData[0]));
-                if (imageData[0])
+                if (processedResCpu_) {
+                    imageData[0] = (const uint8_t*)processedResCpu_;
                     MS_ENSURE(resMgr_->CopyResource(context, cmdQueue, res, type, processedRes_, type, sizeBytes, false));
+                    MS_ENSURE(resMgr_->CopyResource(context, cmdQueue, processedRes_, type, processedResCpu_, blackmagicRawResourceTypeBufferCPU, sizeBytes, false));
+                } else {
+                    MS_ENSURE(resMgr_->CopyResource(context, cmdQueue, res, type, processedRes_, type, sizeBytes, false));
+                }
             }
     // TODO: less copy via [MTLBuffer newBufferWithBytesNoCopy:length:options:deallocator:] from VideoFrame.buffer(0)
             //clog << FOURCC_name(type) << " processedRes_ " << processedRes_ << " res " << res << " imageData: " << (void*)imageData[0] << endl;

This works for me, but it's basically equivalent to what's in this PR

wang-bin commented 10 months ago

Not tested. Here is a new patch I've tested on macOS

diff --git a/BRawReader.cpp b/BRawReader.cpp
index b94fe57..75c84ab 100644
--- a/BRawReader.cpp
+++ b/BRawReader.cpp
@@ -89,7 +89,8 @@ private:
     ComPtr<IBlackmagicRawPipelineDevice> dev_;
     ComPtr<IBlackmagicRawResourceManager> resMgr_;
     ComPtr<IBlackmagicRawClip> clip_;
-    void* processedRes_ = nullptr; // cpu readable(gpu writable?) in copy mode
+    void* processedRes_ = nullptr; // cpu readable gpu writable in copy mode for UMA/NUMA
+    void* processedResCpu_ = nullptr; // cpu readable cpu writable in copy mode for NUMA
     BlackmagicRawResourceType processedType_ = 0;
     BlackmagicRawPipeline pipeline_ = blackmagicRawPipelineCPU; // set by user only
     BlackmagicRawInterop interop_ = blackmagicRawInteropNone;
@@ -447,7 +448,10 @@ bool BRawReader::unload()
         void* cmdQueue = nullptr;
         MS_WARN(dev_->GetPipeline(&pipeline, &context, &cmdQueue));
         MS_WARN(resMgr_->ReleaseResource(context, cmdQueue, processedRes_, processedType_));
+        if (processedResCpu_)
+            MS_WARN(resMgr_->ReleaseResource(context, cmdQueue, processedResCpu_, processedType_));
         processedRes_ = nullptr;
+        processedResCpu_ = nullptr;
     }
     loaded_.reset();
     codec_.Reset();
@@ -595,21 +599,27 @@ void BRawReader::ProcessComplete(IBlackmagicRawJob* procJob, HRESULT result, IBl
             // iOS: -[MTLToolsResource validateCPUWriteable]:135: failed assertion `resourceOptions (0x20) specify MTLResourceStorageModePrivate, which is not CPU accessible.'
             MS_WARN(resMgr_->GetResourceHostPointer(context, cmdQueue, res, type, (void**)&imageData[0])); // metal can get host ptr?
 #endif
-            if (!imageData[0]) { // cuda, ocl
-                if (!processedRes_) {
+            if (!imageData[0]) { // cuda, ocl. it's a device local memory, not cpu readable
+                if (!processedRes_) { // Try to use UMA, create a cpu readable gpu writable resource, then direct access via cpu
                     processedType_ = type;
                     clog << "try CPU readable GPU writable memory for " << FOURCC_name(type) << endl;
                     MS_ENSURE(resMgr_->CreateResource(context, cmdQueue, sizeBytes, type, blackmagicRawResourceUsageReadCPUWriteGPU, &processedRes_));
                     MS_WARN(resMgr_->GetResourceHostPointer(context, cmdQueue, processedRes_, type, (void**)&imageData[0])); // why host ptr is null?
-                    if (!imageData[0]) {
+                    if (!imageData[0]) { // NUMA, requires a stage buffer, which is gpu writable cpu readable
                         clog << "try CPU readable CPU writable memory for " << FOURCC_name(type) << endl;
-                        MS_WARN(resMgr_->ReleaseResource(context, cmdQueue, processedRes_, processedType_));
-                        MS_ENSURE(resMgr_->CreateResource(context, cmdQueue, sizeBytes, type, blackmagicRawResourceUsageReadCPUWriteCPU, &processedRes_)); // processed image is on cpu readable memory?
+                        MS_ENSURE(resMgr_->CreateResource(context, cmdQueue, sizeBytes, type, blackmagicRawResourceUsageReadCPUWriteCPU, &processedResCpu_)); // processed image is on cpu readable memory?
                     }
                 }
-                MS_WARN(resMgr_->GetResourceHostPointer(context, cmdQueue, processedRes_, type, (void**)&imageData[0]));
-                if (imageData[0])
+                if (processedResCpu_) {
+                    MS_WARN(resMgr_->GetResourceHostPointer(context, cmdQueue, processedResCpu_, type, (void**)&imageData[0]));
                     MS_ENSURE(resMgr_->CopyResource(context, cmdQueue, res, type, processedRes_, type, sizeBytes, false));
+                    MS_ENSURE(resMgr_->CopyResource(context, cmdQueue, processedRes_, type, processedResCpu_, type, sizeBytes, false));
+                } else {
+                    if (!imageData[0])
+                        MS_WARN(resMgr_->GetResourceHostPointer(context, cmdQueue, processedRes_, type, (void**)&imageData[0])); // why host ptr is null?
+                    MS_ENSURE(resMgr_->CopyResource(context, cmdQueue, res, type, processedRes_, type, sizeBytes, false));
+                }
+            } else { // UMA, resource is already cpu readable
             }
     // TODO: less copy via [MTLBuffer newBufferWithBytesNoCopy:length:options:deallocator:] from VideoFrame.buffer(0)
             //clog << FOURCC_name(type) << " processedRes_ " << processedRes_ << " res " << res << " imageData: " << (void*)imageData[0] << endl;
AdrianEddy commented 10 months ago

Unfortunately this doesn't work with OpenCL either. The pointers are not null, but the contents are empty, just like with current code

AdrianEddy commented 10 months ago

I even dumped the pointer content to file after the second CopyResource, and the file is all NULLs

wang-bin commented 10 months ago

You patch works well on my amd, nv gpus, but does not work for intel hd graphics 520. I also tested a similar patch

                    frame.setBuffers(nullptr); // allocate
                    imageData[0] = (uint8_t *)frame.buffer(0)->constData();
                    MS_ENSURE(resMgr_->CopyResource(context, cmdQueue, res, type, processedRes_, type, sizeBytes, false));
                    MS_ENSURE(resMgr_->CopyResource(context, cmdQueue, processedRes_, type, (void*)imageData[0], type, sizeBytes, false));

reduce 1 copy, but sometimes a frame is blank, weird. broken for hd 520 too.

AdrianEddy commented 10 months ago

This works for me as well on NVIDIA GPU, but I had to use blackmagicRawResourceTypeBufferCPU in the second CopyResource

MS_ENSURE(resMgr_->CopyResource(context, cmdQueue, processedRes_, type, (void*)imageData[0], blackmagicRawResourceTypeBufferCPU, sizeBytes, false));

maybe there some alignment issue in your case when frame is blank? maybe width*height*4 != sizeBytes? I didn't see blank frames on my side

wang-bin commented 10 months ago

This works for me as well on NVIDIA GPU, but I had to use blackmagicRawResourceTypeBufferCPU in the second CopyResource

Yes, it's blackmagicRawResourceTypeBufferCPU

maybe there some alignment issue in your case when frame is blank? maybe widthheight4 != sizeBytes?

not alignment. usually the frame has content. I repeat stepping forward and backward 1 frame, sometimes the frame becomes blank. Also appears when playing without seek. nv gpu is always fine for me. intel and amd gpu is broken

AdrianEddy commented 10 months ago

hmm, weird Maybe some issue in the SDK? Or some threading issue? In any case, it's still better than no OpenCL at all