Closed 311-code closed 8 months ago
One other issue, I am using virtual desktop and the Quest 3. When I choose the "6dof" option there is no positional tracking. In addition there is no option to move model close in the Z direction that I see, when I shrink the scale it is pretty far away. I am testing a Marigold converted stable diffusion image.
Edit: It ends up there is VR positional tracking with 6dof checked, it just isn't noticeable because scale of the mesh object is too large by default for VR and the webxr camera is unity is set to (-150) on the z axis, workaround is below.
Thanks for letting me know. I'm trying it and it gives me good results. It's interesting that they incorporated Stable Diffusion with this. I think their run.py
can be modified to fit depth.py
(a la this one). I will see if it can be done.
Regarding the "6dof" option - it's actually not intended for positional tracking, and the button is just left for debugging purpose after making fixed-position camera default. (dffac4d) Sorry for the misleading term.
there is no option to move model close in the Z direction that I see,
I guess you're talking about changing the mesh's height. I didn't implement one since I when the camera is fixed (that is, the "6dof" toggle is off) the mesh is always at the center of the camera. (Side note: There remains a MeshVer
slider on the top right but it is to be removed since it is incompatible with ProjRatio
.)
when I shrink the scale it is pretty far away.
Sorry, I don't understand this one.
Nice, got it working on my end via depth.py 👍 It's a lot more detailed, though grainier (which might exacerbate pixel warping) than other models: https://github.com/parkchamchi/DepthViewer/issues/6#issuecomment-1801967525 Though is it supposed to take ~20 seconds per frame, even on CUDA?
It didn't work via ffpymq.py though: it seems to process the frame, but then gets stuck Sending...
and the GUI says Not connected...
:
By the way, I updated my converter scripts: DepthViewer converter+player shortcuts.zip
On a side note, the python script prints a warning when -i
is missing, so would it be possible to automatically treat the input as an image? That'd save me from having to create separate scripts for image and video files. 😅
Though is it supposed to take ~20 seconds per frame, even on CUDA?
It appears to be really computation-heavy, since it uses Stable Diffusion. Judging by the fact omitting --optimize
(that is, not using half-precision floating points) takes eternity to process, I guess this is a VRAM issue. The performance/accuracy tradeoff parameters seem to be denoise_steps
and ensemble_size
, which are 10 for each. I made an cmd arg to set: try python depth.py -r mari -t Bingxin/Marigold --optimize --aux_args den_s=9,ens_s=8
to set them to (9, 8)
. (BTW the dummy data passing at the beginning seems to be not needed for -r mari
).
It didn't work via ffpymq.py though
It's the timeout/fault-tolerance issue on the Unity side, which I set arbitrary before. Use (just put it in initcmd.txt
) set_zmq_fail_tol 1000000
to bar the program from giving up.
automatically treat the input as an image
I've put the arg --detect_img_exts.
To clarify, I meant moving the object closer or farther away when I said Z axis. I meant if I load in someone's head for example the scale is very large and when I shrink it it goes about 10 feet away from me (Edit: NM it's an illusion it's already far away and it just shrinks a tall building that is actually already 150 meters away from vr camera) continued: and I can't seem to move it closer. CamdistL turns it black usually. I just realized the app actually does have 6dof for the headset when turning 6dof setting on, it's just that things are so large it's hard to tell, and shrinking scale puts it's so far away it's also hard to tell it's got 6dof headtracking.
For now I have just been using blender and a 2D plane subdivided by 100 and doing displacement map on it via this video https://youtu.be/tBLk4roDTCQ?si=9sK_g6cQkVoU6Mu_&t=106 adding the texture, then using blender's VR mode to see stuff in VR , was really hoping to get VR Side by side depth video stuff going or some 1 frame per 10 second video with marigold lol.
I am running images through a comfyui workflow to convert to marigold for now as it's very fast, and also converting original stable diffusion images that way. The only downside is it takes about 3 minutes for me to do this every time I update an image in comfyui then update it in blender, I am using Kijaj's comfyui marigold conversion here https://github.com/kijai/ComfyUI-Marigold (with this workflow you can adjust the depthmap in realtime with auto queue setting in comfyui)
I was hoping your app here could maybe do this magic but streamlined, but the positional tracking is pretty huge for immersion it's very unnoticeable due to the scale and inability to move around the object in the space by default.
I just made an AI generated person a bit ago and put in blender with green background, then put them in my room with Quest 3 and used Virtual Desktop passthough mode and it feel like this AI generated person in in the room, it's very interesting. Though only about 160 degrees of them is rendered properly of course. The idea is to do that in this app instead to save time.
To take it a step further, would be amazing we could have your own prompt window for typing in VR and it used automatic1111 --api flag, so could just make AI images directly from your app.
It appears to be really computation-heavy, since it uses Stable Diffusion. Judging by the fact omitting
--optimize
(that is, not using half-precision floating points) takes eternity to process, I guess this is a VRAM issue.
Oh dang, it does seem to use up all my 11GB VRAM
The performance/accuracy tradeoff parameters seem to be
denoise_steps
andensemble_size
, which are 10 for each. I made an cmd arg to set: trypython depth.py -r mari -t Bingxin/Marigold --optimize --aux_args den_s=9,ens_s=8
to set them to(9, 8)
.
Here's without --optimize
, with and with both --optimize
and --aux_args den_s=9,ens_s=8
It just improves performance a bit, plus I wouldn't wanna lower the denoising more than it is so I guess I'll just use Marigold for images and less intensive models for video unless there's another way to improve performance 🤔
It's the timeout/fault-tolerance issue on the Unity side, which I set arbitrary before. Use (just put it in
initcmd.txt
)set_zmq_fail_tol 1000000
to bar the program from giving up.
I tried that but it still got stuck even after several minutes
I've put the arg
--detect_img_exts.
Thanks, it works like charm! 👍
To clarify, I meant moving the object closer or farther away when I said Z axis. I meant if I load in someone's head for example the scale is very large and when I shrink it it goes about 10 feet away from me and I can't seem to move it closer.
Oh I see. The ScaleR
was not intended to change the distance from the camera (which is the domain of CamDist
), but for some reason this also made the mesh farther from the camera. I didn't notice this because I usually don't touch this slider much. Maybe the origin-of-the-transform of the mesh could be the problem. I'll look into this. BTW it'd be helpful if you provide me with the usual value of the ScaleR
you use.
CamdistL turns it black.
I put the code to made it simply disappear (actually move it) for a second since seeing it changing hurt my eyes, is it about it? (If not, the near-plane of the camera could be an issue.
I tried that but it still got stuck even after several minutes
What does the server say? The Unity program was not intended this level of time-consuming inference so it'd keep printing warning but in my case the server at last delivers.
For anyone reading and wanting to check this program out in VR (to convert monoscopic images and video to 3D). There is a workaround for the VR scale being too large, and positional tracking being unnoticeable after turning 6dof option (required for VR positional tracking) and the problem of the objects being too far away when you shrink the scale. (the webxr camera in unity is at -150 z axis from the mesh, and the scale of mesh is as tall as a building)
I would recommend anyone new reading this to check out some of these posts by threedeejay here and repo main page instructions on downloading other depth models, they are pretty amazing for monoscopic video to 3D mesh conversion compared to default midas model. I personally really like the accuracy of Beit and feels a lot like Marigold, though its a bit slower than the dpt hybrid and midas was. The Marigold was pretty slow in this implementation versus creating a depthmap in comfyui kijaj workflow (10 seconds on 4090 there) but it's very cool to see images for now.
for dev:
This program seems to need some sort of addition to move the mesh foward and backwards in the actual space coordinates on one more axis, because even on desktop if you shrink the scale you can't move closer to it. It already has x and y axis, but needs z movement, and also with these new models the rotatiion fade should be removed.
I am guessing the VR camera is far away to accomadate for the location of the desktop gui and camera. Both cameras really need to be moved closer to the mesh (the mesh scale also shrunk, maybe to 0.1 I'm guessing?) and the GUI shrunk but I can't get the project to work due to onnxruntime dll errors.
Motion controller support for grabbing the object when it is finally up close and the correct scale would be much more ideal, and adjusting scaleR by pinching hands in and out (sort of like blender vr mode does) Spacegrab hack is sort of doing this for me for now luckily.
Anyways, this program is really cutting edge imo and with a few fixes in Unity and additions I think could be really useful for greatly enhanced PCVR video and monoscopic to VR image stuff, or viewing AI images in VR with locally installed automatic1111 api access added, which I am trying add, and built in prompting GUI and do this. Though still having .dll errors Have fun!
https://github.com/parkchamchi/DepthViewer/assets/23625562/ab7875ab-1772-46e3-81f6-dc5b47cc2549
The app is almost working well for VR / Mixed reality. I say almost because of all the tinkering required at the moment. This was made with depthviewer and kijaj's stablezero123 360 degree stable diffusion workflow (random AI generated person I made real fast in .mp4 format) I still have to use the openvradvanced settings move the playspace to get the scale right right like you see here, in addition it would be nice to choose other background colors in the app, as black background color doesn't really work that great in VD for passthough, would be better if I could set green background color in depthviewer. I probably should have used BEIT model, this was DPT.
My idea is to just be able to type something in a text box below the model, seed, settings, etc and generate something new with the automatic1111 api, then also move the object around or scale it with the controllers. I still can't get the unity project working though due to the the .dll errors and have been at it for many hours now.
What does the server say? The Unity program was not intended this level of time-consuming inference so it'd keep printing warning but in my case the server at last delivers.
@parkchamchi Basically the same as the second log here https://github.com/parkchamchi/DepthViewer/issues/9#issuecomment-1888252761:
https://github.com/parkchamchi/DepthViewer/assets/23625562/828bbe9c-b652-4f8b-ac9c-3feb5ebaef4b
Sorry for the continuous posting, I'm just kinda of excited about this program. But just updating people if anyone reading. I made another video, this time with Biet model and stable diffusion video generation of a random AI person I just generated (that doesn't exist, not dreambooth but that is next).
You can see in the beginning of the video the offset settings in openvradvanced settings I have to set, to move the scaled down model back into the room.
I had a ton of stuff running on my PC so it was a bit laggy, but found Biet pretty good in general on 4090 and decently accurate (not marigold level for side profile though). You can also see through the hair and different places because of the black background, so an option to choose background colors like blue or green, any color would be ideal as I mentioned before. In addition, you can see when it rotates it autofades in the video which is kind of not what I wanted to happen ;)
Here is the original image I made and converted to with depthviewer, great program with lots of potential here btw.
Okay last one, I finally got Marigold working in the actual GUI instead of command line by using threedejay's updated script he sent me on discord directly, here is the link
Convert_to_3D.zip You run the script then press 0 for marigold, then when it's done loading manually launch depthviewer. then press ~ for console and type zmq_id 5556
Edit: Also appears there is a --half_precision
flag you can add after --optimize
that will help with the VRAM issues if you are having them with this script. Also if you edit the run.py (or marirunner.py I forgot which I did) you can adjust the batch size in the beginning of the script from 0 which seems to autodetect resources in not a great way. I'm having some luck with batch size 4-6 range and it doesn't crash anymore. Also the resolution setting in run.py default=768 "Maximum resolution of processing. 0 for using input image resolution." Lowering may help.
Once depthviewer starts in the console set_zmq_fail_tol 1000000
, I would still get server failed when loading an image, so I also had to do set_zmq_timeout 1000000
then zmq_id 5556
. Now I am able to load images, the side profile of faces definitely looks better on Marigold than even Beit for monoscopic conversion. If it doesn't work or says not connected do zmq_id 5556 again and those settings.
Edit: Here is beit side profile for comparison, I had to use different alpha, beta, depthmultirl settings because it very flat at same settings for some reason. You can also see the fading with rotation here. ;)
Tiktok just published depthanything, looks really good (better than marigold?) They compare it to the "previous best midas 3.1 beit" https://github.com/LiheYoung/Depth-Anything?tab=readme-ov-file
@brentjohnston You beat me to it :) Looks so much better for video coherence.
Trying my best to get the depth-anything model going, here are the onnx files (thanks threedeejay) but in play mode, when I test load the model in options menu, Unity gives this error in console:
**Invalid dimensions for RenderTexture**
UnityEngine.Debug:LogError (object)
OnnxRuntimeDepthModel:.ctor (string,string,string,int,string) (at Assets/Scripts/DepthModels/OnnxRuntimeDepthModelBehavior.cs:96)
DepthModelBehavior:GetDepthModel (string,string,bool) (at Assets/Scripts/DepthModels/DepthModelBehavior.cs:76)
MainBehavior:LoadModel (string,bool) (at Assets/Scripts/MainBehavior.cs:522)
OptionsBehavior:LoadModel () (at Assets/Scripts/UI/OptionsBehavior.cs:130)
UnityEngine.EventSystems.EventSystem:Update () (at Library/PackageCache/com.unity.ugui@1.0.0/Runtime/EventSystem/EventSystem.cs:501)
The other models like beit still work fine of course. I tried 512x512 and 518x518 because I read that somewhere, also resizing the input image. I tried to use Jupyter to compare to the other models that work like beit / dpt hybrid 384 using an edited compare_onnx.ipynb, but had dependency errors in Jupyter I couldn't get past. Having trouble finding model parameters this model requires to make this work, or could be some other problem.
Attempted modifying the OnnxRuntimeDepthModelBehavior.cs file, below is as far as I got, and it could be all wrong because I'm not sure if some of the other scripts tie into all of this. (scroll down to // Adjust dimensions for the depth-anything model)
#if UNITY_STANDALONE_WIN || UNITY_EDITOR_WIN
//#define USING_ONNX_RUNTIME
using System;
using System.Collections;
using System.Collections.Generic;
using System.Linq;
using UnityEngine;
using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
public class OnnxRuntimeDepthModel : DepthModel {
public string ModelType { get; private set; }
private InferenceSession _infsession;
private int _width, _height;
private string _inputname;
private int _outwidth, _outheight;
private RenderTexture _rt;
private float[] _output;
public OnnxRuntimeDepthModel(string onnxpath, string modelType, string provider=null, int gpuid=0, string settings=null) {
ModelType = modelType;
if (settings == null) settings = "";
if (provider == null) provider = "default";
Debug.Log($"OnnxRuntimeDepthModel(): using the provider {provider}");
SessionOptions sessionOptions = new SessionOptions();
switch (provider.ToLower()) {
case "":
case "default":
Debug.Log("OnnxRuntime may not use GPU. Try other GPU execution provider.");
break;
case "cuda":
Debug.Log($"Using gpuid={gpuid}");
sessionOptions = SessionOptions.MakeSessionOptionWithCudaProvider(gpuid);
break;
case "openvino":
Debug.Log($"settings (default empty string): \"{settings}\"");
sessionOptions.AppendExecutionProvider_OpenVINO(settings);
break;
case "directml": //Not tested
Debug.Log($"Using gpuid={gpuid}");
sessionOptions.GraphOptimizationLevel = GraphOptimizationLevel.ORT_ENABLE_ALL;
sessionOptions.AppendExecutionProvider_DML(gpuid);
break;
case "tvm": //Not tested
Debug.Log($"settings: \"{settings}\"");
sessionOptions = SessionOptions.MakeSessionOptionWithTvmProvider(settings);
break;
case "rocm": //Not tested
Debug.Log($"Using gpuid={gpuid}");
sessionOptions = SessionOptions.MakeSessionOptionWithRocmProvider(gpuid);
break;
default:
Debug.LogError($"Unknown provider: {provider}");
break;
}
try {
_infsession = new InferenceSession(onnxpath, sessionOptions);
}
catch (OnnxRuntimeException exc) {
Debug.LogWarning($"OnnxRuntimeException, provider: {provider} => {exc}");
throw new InvalidOperationException();
}
foreach (KeyValuePair<string, NodeMetadata> item in _infsession.InputMetadata) {
_inputname = item.Key;
_width = item.Value.Dimensions[2];
_height = item.Value.Dimensions[3];
} //only 1
foreach (KeyValuePair<string, NodeMetadata> item in _infsession.OutputMetadata) {
_outwidth = item.Value.Dimensions[1];
_outheight = item.Value.Dimensions[2];
} //only 1
// Adjust dimensions for the depth-anything model
if (modelType == "depth-anything") {
_width = 512; // dimension for depth-anything model??
_height = 512; // dimension for depth-anything model??
}
// Ensure that the dimensions are valid?
if (_width <= 0 || _height <= 0) {
Debug.LogError("Invalid dimensions for RenderTexture");
return;
}
_rt = new RenderTexture(_width, _height, 16);
}
public Depth Run(Texture inputTexture) {
// Check if the RenderTexture is valid...
if (_rt == null || _rt.width <= 0 || _rt.height <= 0) {
Debug.LogError("RenderTexture not initialized or has invalid dimensions");
return null;
}
int w = _width;
int h = _height;
if (w != _rt.width || h != _rt.height) {
_rt.Release();
_rt = new RenderTexture(w, h, 16);
}
int length = w * h;
Graphics.Blit(inputTexture, _rt);
Texture2D tex = new Texture2D(w, h);
RenderTexture.active = _rt;
tex.ReadPixels(new Rect(0, 0, w, h), 0, 0);
RenderTexture.active = null;
UnityEngine.GameObject.Destroy(tex);
var rawdata = tex.GetRawTextureData();
float[] rfloats = new float[length];
float[] gfloats = new float[length];
float[] bfloats = new float[length];
for (int i = 0; i < length; i++) {
int row = h - (i / w) - 1;
int col = (i % w);
int k = row * w + col;
rfloats[k] = (float)rawdata[i * 4 + 0] / 255;
gfloats[k] = (float)rawdata[i * 4 + 1] / 255;
bfloats[k] = (float)rawdata[i * 4 + 2] / 255;
// Alpha channel is ignored
}
var dimensions = new ReadOnlySpan<int>(new []{1, 3, h, w});
var t1 = new DenseTensor<float>(dimensions);
for (var j = 0; j < _height; j++) {
if (j >= h) continue;
for (var i = 0; i < w; i++) {
if (i >= w) continue;
var index = j * w + i;
t1[0, 0, j, i] = rfloats[index];
t1[0, 1, j, i] = gfloats[index];
t1[0, 2, j, i] = bfloats[index];
}
}
var inputs = new List<NamedOnnxValue>() {
NamedOnnxValue.CreateFromTensor<float>(_inputname, t1)
};
using var results = _infsession?.Run(inputs);
float[] output = results?.First().AsEnumerable<float>().ToArray();
results?.Dispose();
float max = output.Max();
float min = output.Min();
for (int i = 0; i < output.Length; i++)
output[i] = (output[i] - min) / (max - min);
return new Depth(output, _outwidth, _outheight);
}
public void Dispose() {
_infsession?.Dispose();
_infsession = null;
_rt?.Release();
_rt = null;
}
public void PrintMetadata() {
foreach (var mItem in new Dictionary<string, IReadOnlyDictionary<string, NodeMetadata>> {{"InputMetadata", _infsession.InputMetadata}, {"OutputMetadata", _infsession.OutputMetadata}}) {
Debug.Log($"************************{mItem.Key}");
foreach (KeyValuePair<string, NodeMetadata> item in mItem.Value) {
Debug.Log("+++++" + item.Key + ": ");
var v = item.Value;
Debug.Log($"Dimensions:{v.Dimensions}");
Debug.Log($"Dimensions.Length:{v.Dimensions.Length}");
foreach (var e in v.Dimensions) Debug.Log(e);
Debug.Log($"ElementType:{v.ElementType}");
Debug.Log($"IsTensor:{v.IsTensor}");
Debug.Log($"OnnxValueType:{v.OnnxValueType}");
Debug.Log($"SymbolicDimensions:{v.SymbolicDimensions }");
Debug.Log($"SymbolicDimensions.Length:{v.SymbolicDimensions.Length}");
foreach (var e in v.SymbolicDimensions) Debug.Log(e);
}
}
Debug.Log("************************MODELMETADATA");
var mm = _infsession.ModelMetadata;
foreach (KeyValuePair<string, string> item in mm.CustomMetadataMap)
Debug.Log(item.Key + ": " + item.Value);
Debug.Log($"Description:{mm.Description}");
Debug.Log($"Domain:{mm.Domain}");
Debug.Log($"GraphDescription:{mm.GraphDescription}");
Debug.Log($"GraphName:{mm.GraphName}");
Debug.Log($"ProducerName:{mm.ProducerName}");
Debug.Log($"Version:{mm.Version}");
}
}
#else
public class OnnxRuntimeDepthModel : DepthModel {
public OnnxRuntimeDepthModel(string s1, string s2, int i1) {
Debug.LogError("This should not be shown.");
}
}
#endif
@brentjohnston Did it work? Anyway please continue this at #14.
A new game changing depth estimation model called Marigold has released, I have been using it lately and it's very good. Would be awesome to see if implemented into this app for viewing things in VR. git clone https://huggingface.co/Bingxin/Marigold also https://github.com/prs-eth/Marigold