prs-eth / Marigold

[CVPR 2024 - Oral, Best Paper Award Candidate] Marigold: Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation
https://marigoldmonodepth.github.io
Apache License 2.0
2.36k stars 132 forks source link

noisy inference, any idea why ? #106

Closed YacineDeghaies closed 2 months ago

YacineDeghaies commented 2 months ago

After training the model using my own dataset and replacing the unet/ folder, the inference seems to be very slow is there a way to improve its speed ?

For training I used the stable-diffusion-2

The inferred images look so noisy too, any idea why ?

thank you ! inferred image: output_image

Input images: shot_0003_source_0000

ground truth depth map: shot_0003_dm_gt_s_0000

The inference setting:


#!/usr/bin/env bash
set -e
set -x

# Use specified checkpoint path, otherwise, default value
ckpt=${1:-"checkpoint/stable-diffusion-2"}
subfolder=${2:-"eval"}
BASE_DATA_DIR="/vol/fob-vol3/mi20/deghaisa/code/"

python infer.py  \
    --checkpoint $ckpt \
    --seed 1234 \
    --base_data_dir $BASE_DATA_DIR \
    --denoise_steps 50 \
    --ensemble_size 10 \
    --processing_res 0 \
    --dataset_config config/dataset/data_dsb_test.yaml \
    --output_dir output/${subfolder}/dsb_test/prediction \
markkua commented 2 months ago

Hi, you can use smaller ensemble size and less steps, for example, 10 steps with enseomble_size = 1

YacineDeghaies commented 2 months ago

Hi, you can use smaller ensemble size and less steps, for example, 10 steps with enseomble_size = 1

By less steps do you mean --denoise_steps 10 ?

markkua commented 2 months ago

yes

YacineDeghaies commented 2 months ago

My Depth Maps have 1-channel. They are 8-bit relative depth maps. Can this be also a factor ? I've just uploaded an example of my ground truth depth maps for you to see the difference. ^^

markkua commented 2 months ago

If you are talking about the noise on your prediction, I recommend to try it with our checkpoint, and also consider if your training data is diverse enough or if the model was trained for enough steps. 8-bit depth maps have worse accuracy compared to regular depth datasets. This could be one reason.