tSchlegl / f-AnoGAN

Code for reproducing f-AnoGAN training and anomaly scoring
MIT License
246 stars 63 forks source link

How to interpret the output CSV file? #10

Open gpcarnielli opened 4 years ago

gpcarnielli commented 4 years ago

Hello. I'm trying to understand the CSV file, generated as the output of the anomaly detection step.

There are three columns in that file: first one is related to an internal variable called is_anom, the other columns are distances that must express some probability of an image to have an anomaly.

Well, I have two questions about such information:

  1. How may I know which line of the file is associated to a fake image (image with anomaly)? Initially, I thought that the first column should indicates that. But, in my tests, I only got zeros?
  2. How should I make sense of the distance values? In other words, how can I interpret those numbers in order to identify anomalies?

Thanks in advance!

jzkoh commented 4 years ago

I am also facing the same problem. How do we interpret the csv file? Is there a way to match the results with the file name of the image files so that we can match the anomaly scores?

Also, in the paper, there was AUC, precision sensitivity, specificity and f1 score (table A.1 in appendix). There is also a distribution of anomaly scores in appendix B.3. Does anyone know how to code them?

Thank you!

thiagoribeirodamotta commented 4 years ago

So, I'm not 100% sure on this, but this is my take on anomaly_detection.py:

When running this script, we face the following loop: for is_anom, _gen in enumerate([test_gen(), ano_gen()]) Which will output is_anom=0 and _gen = test_gen() at first and then is_anom=1 and _gen=ano_gen() later.

So is_anom is not really a prediction of any kind. Is merely a flag regarding the generator used.

_dist_z seems to actually be the chosen metric, having MSE as its default value. So it's the MSE between real data and used Encoder. I'm guessing its that formula at section 2.3. of the paper that comes after equation 4 and right before (Fig. 3b) mention.

_img_dist also uses the chosen metric, but it's the MSE between the generated image (recon_image) and normalized input data (real_data_norm)

To interpret the rows, I did the following:

    files = lib.img_loader.get_nr_test_samples_files()
    test_set_name = 'test(non-anom)'
    anom_set_name = 'anom'
    files_set = [test_set_name for x in files[0]] + [anom_set_name for x in files[1]]
    files = files[0] + files[1]

Did the following before the aforementioned for loop:

        fileIdx = 0
        with open( log_meta_path, "a", newline='' ) as f:
            writer = csv.writer(f, delimiter=',')
            writer.writerow(["Set", "Filename", "IsAnom", "ImgDist", "DistZ", "Anom and Set Match?"])

Modified the writer.writerow call to be like this:

with open( log_meta_path, "a", newline='' ) as f:
    writer = csv.writer(f, delimiter=',')
    for di,dz in zip(_img_dist, _dist_z):
        writer.writerow( 
            [files_set[fileIdx], 
             os.path.basename(files[fileIdx]),
            is_anom, di, dz] )
        fileIdx = fileIdx + 1

Now the first column becomes the labels 'test(non-anom)' and 'anom' depending on the generator used, second column is the input image, and the 3 remaining columns remain as they were.

It is still not clear to me how to use these values to actively label an image as anomalous or normal. What I am still trying to do is generate images composed of the difference between real data and generated data, perhaps calculate the total area of the blobs from these differences and try to correlate them with _dist_z and _img_dist.

UC, precision sensitivity, specificity and f1 score and other metrics does not seem to be present at this code.

heidarinejad commented 3 years ago

Is there any new finding on how to interpret the CSV data to label anomaly included images... even the number of the scored lines inside CSV file which I assume should be equal to the number of the data inside anomal and test folders are different. Any type of info is appreciated.