Closed igv closed 3 years ago
That weight per scale score tweak hinted that perhaps the original scale weights were non-optimal. I've adjusted them, and that gave a decent improvement.
but why does it have .powf(1/2^n)?
My tweak. Initially it was .powf(0.5) (MAD used in MDSI metric and they set it to 0.25 (EDIT: I forgot, it's a different parameter in MDSI)), but then I found out that when comparing only at first scale (SSIM), the best accuracy is with .powf(1) (almost on par with VIF, try it), so I tested it with each scale separately. It's called mean absolute deviation, but you can also use median or mode or your own weighted mean.
I've adjusted them, and that gave a decent improvement.
I wouldn't change default scale weights only based on 1 dataset. Also Full accuracy on tid13 is so high with these weights because of Exotic group (which is pretty much irrelevant).
Finally, according to tid2013 paper, SSIM is among the best metrics for "Good quality" (Table 6), so maybe add -ssim switch to calculate SSIM instead of MS-SSIM.
Afaik, this is how these scale weights calculated:
cpd represent cycles per degree of visual angle which is determined by the viewing resolution factor.
Yes, indeed I'm worried about overfitting for TID2013. Can you recommend other similar datasets that I could use to verify accuracy?
The most similar is KADID-10k. Only they made a mistake and placed reference images instead of first level distorted images for 3 distortion types (1st, 3rd and 8th). The most popular are LIVE and CSIQ. For some reason not very popular, but seems good - ESPL.
With my max+avg pooling method my thinking was:
if there's a small but significant distortion in the image (e.g. a single bad pixel), then it actually affects a larger area than itself, because it's likely to be spoiling an entire object/area it is in. This is why there's a "max" pooling step.
But an image likely contains multiple objects, and one flaw is not disqualifying the whole image, which is why there's an averaging step too.
That solution is not mathematically elegant, but I like it, because it intuitively makes sense to me.
I've read how MAD works, but I'm not quite sure why this particular formula, especially with the "tweak", happens to work for pooling SSIM.
My hypothesis is:
For small n
in pow(1/2^n)
MAD is behaving like standard deviation, so it gives worse scores to images with multiple bad areas, but is insensitive to uniformly bad images.
For bigger n
it makes avg
approach 1
, which makes MAD behave more like a simple average, which lets the other layers still catch uniformly bad images.
That makes me wonder whether this tweaked MAD just happens to be another way of mixing "worst" vs "average" error pooling, just not within each scale, but across scales.
And I wonder whether the proportion from pow(1/2^n)
is ideal, or maybe it should be based on scale weights. This is something to experiment with.
It doesn't behave like std dev (I edited my comment above about MAD in MDSI) and it always makes avg
approach 1
(that's the idea) except at 1st scale. 1/pow(1/2^n)
is just a short for [1, 0.5, 0.25, 0.125, 0.0625][n]
, it has no relation to scale weights and impact on accuracy and ssim scores is minimal (vs for example [1.0, 0.5, 0.5, 0.5, 0.5][n]
).
At first, after arithmetic mean, I tried geometric and harmonic means. They made accuracy worse (harmonic was the worst). From wikipedia: For all positive data sets containing at least one pair of nonequal values, the harmonic mean is always the least of the three means, while the arithmetic mean is always the greatest of the three and the geometric mean is always in between. So I added that exponent to bias avg towards good quality scores. The bigger downsampling factor the higher and less spread per-pixel ssim scores are (except for some Exotic distortion types probably), so the exponent value must be lower to be effective.
That makes me wonder whether this tweaked MAD just happens to be another way of mixing "worst" vs "average" error pooling, just not within each scale, but across scales.
It's still within each scale, not across scales (cross scale pooling is in this block).
And I wonder whether the proportion from pow(1/2^n) is ideal
Maybe not ideal, but I tried with different constants (like 1.5, 1.75, 2.25, 2.5) and also with inverse square law formula (1/n^2).
Tested on KADID-10k using scripts from here (only modified kadid10k.ROCC.py to show per type and per level accuracy).
1st is max+avg, 2nd is mad with non-custom scale weights.
Full SROCC: -0.8525045482307988 -0.8562234657395806
1: -0.9482709952684674 -0.9524144401242203
2: -0.888920125560901 -0.9036008877848952
3: -0.9361601718850355 -0.942732720555493
4: -0.9035452008410343 -0.9054443006074244
5: -0.7505439463819739 -0.7721170099865738
6: -0.877401725595138 -0.8779770694488326
7: -0.6083002027988245 -0.5992010191419064
8: -0.9202345152541264 -0.9192436523541861
9: -0.9316652477020476 -0.9389697724114447
10: -0.8725177456252896 -0.8922033461740347
11: -0.8973088936601757 -0.8921109535136115
12: -0.9312734837315821 -0.9300191542963452
13: -0.8081662575202928 -0.8028804197128389
14: -0.940544646168691 -0.9413759571962214
15: -0.8984371485985785 -0.8985430184742703
16: -0.9182612634610633 -0.9197175157873034
17: -0.8913491941768678 -0.8932455426471694
18: -0.7350325589908322 -0.7442556609995762
19: -0.9281146136611702 -0.931734634444742
20: -0.36244064860224723 -0.37055703484238806
21: -0.4742315253111955 -0.48728508844891577
22: -0.8585757413960478 -0.8672451350219674
23: -0.5884619420813891 -0.5820224099228524
24: -0.8857312844251738 -0.8900615092376575
25: -0.75146916282513 -0.7677897943743471
-0.7594226220260838 -0.7568424979839856
-0.7859239032568683 -0.7905589936285138
-0.7645387478162078 -0.7726706920108036
-0.6711213371339377 -0.6899222144730359
-0.598736695180533 -0.6224275615532653
kadid10k.ROCC.py
:
import pandas as pd
df = pd.read_csv('kadid10k.dssim1.csv')
df2 = pd.read_csv('kadid10k.dssim2.csv')
print('Full SROCC:')
print(df[['dmos', 'DSSIM']].corr('spearman').loc["DSSIM"][0])
print(df2[['dmos', 'DSSIM']].corr('spearman').loc["DSSIM"][0])
print('\nFull KROCC:')
print(df[['dmos', 'DSSIM']].corr('kendall').loc["DSSIM"][0])
print(df2[['dmos', 'DSSIM']].corr('kendall').loc["DSSIM"][0])
print("\nPer type:")
for i in range(25):
a=df['dist_img'].str.split("_", expand=True)[:][1].astype(int).isin([i+1])
print("{}:".format(i+1))
print(df[['dmos', 'DSSIM']][a].corr('spearman').loc["DSSIM"][0])
print(df2[['dmos', 'DSSIM']][a].corr('spearman').loc["DSSIM"][0])
print("\n------------------")
print("\nPer level:")
for i in range(5):
a=df['dist_img'].str.split("_", expand=True)[:][2].str.split(".", expand=True)[:][0].astype(int).isin([i+1])
print(df[['dmos', 'DSSIM']][a].corr('spearman').loc["DSSIM"][0])
print(df2[['dmos', 'DSSIM']][a].corr('spearman').loc["DSSIM"][0])
print("\n------------------")
Btw, there is my implementation of MDSI in Python. I removed it from my repo because I didn't like this metric very much. But maybe it's actually good?
Thanks for the data. It looks good.
It makes DSSIM output scores that have a different magnitude than previously, so I'll probably add some rescaling fudge in order to avoid breaking users to much.
Thinking 5x5 binomial / Gaussian (std=1) down-sampling filter might work better with MAD than 2x2 avg.
I guess it could help a little. IIRC change of 2x box blur to proper Gaussian kernel in SSIM blurring helped a little too.
Made a python version of this metric (grayscale only) for testing, with an alternative version of MAD.
import sys
from PIL import Image
import numpy as np
from scipy.ndimage import gaussian_filter
WEIGHTS = [0.0448]#, 0.2856, 0.3001, 0.2363, 0.1333]
def msssim(file1, file2):
img1 = Image.open(file1).convert('RGB')
img2 = Image.open(file2).convert('RGB')
width, height = img1.size
img1 = np.frombuffer(img1.tobytes(), dtype=np.uint8).reshape(height, width, 3) / 255
img2 = np.frombuffer(img2.tobytes(), dtype=np.uint8).reshape(height, width, 3) / 255
img1 = np.where(img1 > 0.04045, np.power((img1 + 0.055) / 1.055, 2.4), img1 / 12.92)
img2 = np.where(img2 > 0.04045, np.power((img2 + 0.055) / 1.055, 2.4), img2 / 12.92)
img1 = 0.2126 * img1[:,:,0] + 0.7152 * img1[:,:,1] + 0.0722 * img1[:,:,2]
img2 = 0.2126 * img2[:,:,0] + 0.7152 * img2[:,:,1] + 0.0722 * img2[:,:,2]
mssim = []
for i in range(len(WEIGHTS)):
mssim.append(ssim(pow(img1,1./2.2), pow(img2,1./2.2), i, i<len(WEIGHTS)-1))
#img1 = avgpooling(img1, 2)
#img2 = avgpooling(img2, 2)
img1 = gaussian_filter(img1, 1.08, truncate=1.5)[::2,::2]
img2 = gaussian_filter(img2, 1.08, truncate=1.5)[::2,::2]
return np.sum(np.multiply(np.stack(mssim), WEIGHTS)) / np.sum(WEIGHTS)
def mad(x, l):
return np.mean(np.absolute(x - np.power(np.mean(x), np.power(.5, l)))) # np.mean(np.absolute(x - np.mean(x if l==0 else np.sort(x, axis=None)[-int(x.size//1.5):])))
def ssim(L1, L2, lvl, cs_map):
C1=(0.01)**2
C2=(0.03)**2
sd, t = 1.5, 3 #kernel radius = round(sd * truncate)
mu1 = gaussian_filter(L1, sd, truncate=t)
mu2 = gaussian_filter(L2, sd, truncate=t)
mu1_sq = mu1 * mu1
mu2_sq = mu2 * mu2
mu1_mu2 = mu1 * mu2
sigma1_sq = gaussian_filter(L1 * L1, sd, truncate=t) - mu1_sq
sigma2_sq = gaussian_filter(L2 * L2, sd, truncate=t) - mu2_sq
sigma12 = gaussian_filter(L1 * L2, sd, truncate=t) - mu1_mu2
if cs_map:
value = (2.0*sigma12 + C2)/(sigma1_sq + sigma2_sq + C2)
else:
value = ((2.0*mu1_mu2 + C1)*(2.0*sigma12 + C2))/((mu1_sq + mu2_sq + C1)*
(sigma1_sq + sigma2_sq + C2))
return mad(value, lvl)
def avgpooling(mat, ksize):
m, n = mat.shape[:2]
ny = m // ksize
nx = n // ksize
mat_pad = mat[:ny*ksize, :nx*ksize, ...]
new_shape = (ny, ksize, nx, ksize) + mat.shape[2:]
result = np.nanmean(mat_pad.reshape(new_shape), axis=(1,3))
return result
def main():
for arg in sys.argv[2:]:
score = msssim(sys.argv[1], arg)
print(str(score) + "\t" + arg)
if __name__ == '__main__':
main()
That works great, but why does it have
.powf(1/2^n)
?