mattyamonaca / PBRemTools

Precise background remover
MIT License
344 stars 26 forks source link

Explanation of parameters? #18

Open nonnull-ca opened 1 year ago

nonnull-ca commented 1 year ago

Here's what I've grokked from a combination of playing around and reading the source code. Is this roughly accurate?

Segment Anything & CLIP

sam

1. enabled

This is a misnomer. If this is enabled, Segment Anything & CLIP are both enabled.

If this is disabled, it segments the image using anime-seg instead, and doesn't do anything CLIP-like.

2. Model

s == small, l == large, h == huge? I kind of wish they had chosen 'xl' instead, as that's a more common term. If you don't have anything here, you need to download the models and put them in the appropriate folder (see readme).

3. segmentation prompt

Beware that this does not do anything unless both a) Segment Anything & CLIP are enabled, and b) either tile division BG remover or cascadePSP is enabled.

This can be blank, in which case in practice it selects regions of the image that are clearly recognizable (typically foreground). See below.

This can also be a prompt that CLIP can grok. You can flip to img2img and use the 'interrogate CLIP' button on an image to get an idea of the sorts of prompts that CLIP generates, though I've been getting best results with extremely short prompts ("person", for instance)

4. predicted_iou_threshold

5. stability_score_threshold

6. clip_threshold

These all only affect Segment Anything & CLIP.

It's easier to describe these together. Looking at the code, roughly speaking what happens is:

a. Segment Anything is run to segment the image into a bunch of regions. b. Segment Anything also returns some metadata about said regions, including a couple of different measures of 'quality' of regions. Among these measures are predicted_iou and stability_score. c. Any regions that have a predicted_iou or stability_score lower than predicted_iou_threshold or stability_score_threshold, respectively, are filtered out and ignored. d. If there is a segmentation prompt, CLIP is run on all remaining regions, and any region that CLIP thinks has a similarity of less than clip_threshold to the segmentation prompt is filtered out and ignored. e. The union of all remaining regions are taken as the resulting mask.

So all told:

a. First run with no segmentation prompt. b. Adjust predicted_iou_threshold and/or stability_score_threshold down until you include the entire subject, and then back up until you are excluding as much of the rest of the image as possible. c. Re-add the segmentation prompt. d. Adjust clip_threshold down until you are including the entire subject, then back up until you are excluding as much of the rest of the image as possible. e. Hopefully you're done. If you get bad results, you need to tweak the segmentation prompt and retry from d.

tile division BG Removers

tdbgr

This is an approach to refine an existing mask, using the mask and the input image.

7. enabled

If enabled, td-abg will be run at the end. Yes, this is out of order. If disabled, it will not be run.

The rest of this is covered fairly well here.

General tuning:

  1. The more divisions you have, the more likely is that you'll get noise just from random chance. But too few divisions and you'll get poor results - if part A of the subject shares similar colors as part B of the background, you don't want them in the same cell.
  2. The more clusters you have, the more likely is that you'll get noise just from random chance. But too few clusters and you'll get blobs from parts of the subject getting lumped together with parts of the background.
  3. The alpha threshold is the only setting that can shrink the mask. Everything else expands it. So you can start by setting mask content ratio to 0 to effectively disable mask expansion, and adjusting alpha threshold until nothing outside the subject is in the mask, and then adjusting the other settings from there.

cpsp

13. enabled

If enabled, Cascade-PSP will be run after segmentation (and before td-abg, if enabled). If disabled, it will not be run.

14. fast

15. Memory usage

It's easier to cover both of these together. 'Memory usage is a misnomer'. A better description might be 'maximum native input resolution'.

The TL;DR is roughly:

a. If the input image is less than 'memory usage' pixels wide/high, you may as well use fast mode. b. If the input image is more than 'memory usage' pixels wide/high, you have three options: b.i. use fast mode, with poor image quality. b.ii. use fast mode but increase 'memory usage' to a larger value to accommodate the image. b.iii. disable fast mode, with lower memory usage than ii but likely better results.

If you have the patience, and gpu memory, ~1300 is the optimum value for memory usage.


If the above is accurate there's a bunch of UI cleanup I could do.

mattyamonaca commented 1 year ago

Approximately, your description of the parameters is correct. Any help in cleaning up the UI would be greatly appreciated!