Evaluating Cityscapes - Githubissues

tychovdo commented 6 years ago

Hi,

I'm having difficulties reproducing the results from the CycleGAN paper for the cityscapes evaluation. For the city->label classification scores I get very similar results. But, for the label->photo FCN score experiment I get really bad results. I used the code from the ./scripts/eval_cityscapes folder and trimmed it down a bit to find the error (see code below): I load a single image from the cityscapes dataset, resize and preprocess it using the code from the repo and then perform a forward pass through the pretrained caffe model.

Unfortunately, the caffe model outputs mostly 0s. Do you have any suggestions?

caffemodel_dir = 'caffemodel/'
caffe.set_mode_cpu();
net = caffe.Net(caffemodel_dir + '/deploy.prototxt',
                caffemodel_dir + 'fcn-8s-cityscapes.caffemodel',
                caffe.TEST)

def preprocess(im):
    in_ = np.array(im, dtype=np.float32)
    in_ = in_[:, :, ::-1]
    in_ -= np.array((72.78044, 83.21195, 73.45286), dtype=np.float32)
    in_ = in_.transpose((2, 0, 1))
    return in_

orig = Image.open('../../../pix2pix/scripts/eval_cityscapes/leftImg8bit/train/dusseldorf/dusseldorf_000087_000019_leftImg8bit.png')
resized = scipy.misc.imresize(np.array(orig), (256, 256))
segmented = segrun(net, preprocess(resized))

download ^Left to right: "orig", "resized" and "segmented"

Thanks in advance.

tinghuiz commented 6 years ago

Are you able to re-produce the ground-truth numbers by running the provided script?

tychovdo commented 6 years ago

Not for the FCN scores. The pre-trained caffe model doesn't seem to give correct outputs.

tinghuiz commented 6 years ago

What's the number you are getting?

tychovdo commented 6 years ago

The provided script does not resize the images down to 256x256 (due to an outcommented line). When I run the script on the ground-truth images in "gtFine/val/frankfurt" and look at the images outputted by the pretrained model I get:

input:(1024x2048)

segmentation:(1024x2048)

ground-truth: (1024x2048)

Rescaling the images to 256x256 before feeding them to the pretrained model does not seem to help:

input: (256x256)

segmentation: (256x256)

rescaled segmentation: (256x256)

ground-truth: (1024x2048)

Did you get better looking segmentation masks?

tychovdo commented 6 years ago

Evaluating on the first 20 images in "gtFine/val/frankfurt" using 256x256 scaling results in these scores:

Mean pixel accuracy: 0.424817
Mean class accuracy: 0.054131
Mean class IoU: 0.024102
************ Per class numbers below ************
road           : acc = 0.999520, iou = 0.429499
sidewalk       : acc = 0.000478, iou = 0.000376
building       : acc = 0.025424, iou = 0.025024
wall           : acc = 0.000000, iou = 0.000000
fence          : acc = 0.000000, iou = 0.000000
pole           : acc = 0.000097, iou = 0.000095
traffic light  : acc = 0.000000, iou = 0.000000
traffic sign   : acc = 0.000238, iou = 0.000225
vegetation     : acc = 0.000021, iou = 0.000021
terrain        : acc = 0.000000, iou = 0.000000
sky            : acc = 0.002707, iou = 0.002705
person         : acc = 0.000000, iou = 0.000000
rider          : acc = 0.000000, iou = 0.000000
car            : acc = 0.000000, iou = 0.000000
truck          : acc = 0.000000, iou = 0.000000
bus            : acc = 0.000000, iou = 0.000000
train          : acc = 0.000000, iou = 0.000000
motorcycle     : acc = 0.000000, iou = 0.000000
bicycle        : acc = 0.000000, iou = 0.000000

So, pretty bad, but as expected (when taking into account that the segmentation masks is classifying almost everything as "road").

tinghuiz commented 6 years ago

Just to make sure, to get the ground-truth number, did you first construct a folder of original Cityscapes images resized to 256x256 and then run the provided script without modification?

python ./scripts/eval_cityscapes/evaluate.py --cityscapes_dir /path/to/original/cityscapes/dataset/ --result_dir /path/to/resized/images/ --output_dir /path/to/output/directory/

tychovdo commented 6 years ago

The results above we obtained using a modified version of the script. Now I tried to resize the images to 256x256 and run the provided script without modifications and get similar results:

Mean pixel accuracy: 0.429819
Mean class accuracy: 0.054688
Mean class IoU: 0.024783
************ Per class numbers below ************
road           : acc = 0.999237, iou = 0.431893
sidewalk       : acc = 0.001446, iou = 0.001225
building       : acc = 0.031437, iou = 0.030817
wall           : acc = 0.000000, iou = 0.000000
fence          : acc = 0.000000, iou = 0.000000
pole           : acc = 0.000000, iou = 0.000000
traffic light  : acc = 0.000000, iou = 0.000000
traffic sign   : acc = 0.000000, iou = 0.000000
vegetation     : acc = 0.000000, iou = 0.000000
terrain        : acc = 0.000000, iou = 0.000000
sky            : acc = 0.006945, iou = 0.006943
person         : acc = 0.000000, iou = 0.000000
rider          : acc = 0.000000, iou = 0.000000
car            : acc = 0.000000, iou = 0.000000
truck          : acc = 0.000000, iou = 0.000000
bus            : acc = 0.000000, iou = 0.000000
train          : acc = 0.000000, iou = 0.000000
motorcycle     : acc = 0.000000, iou = 0.000000
bicycle        : acc = 0.000000, iou = 0.000000

0_input.jpg (256x256):

0_pred.jpg (256x256):

0_gt.jpg (256x256):

tinghuiz commented 6 years ago

These are also numbers from the first 20 images? Is it possible for you to run on the entire test set or does it take too long?

tychovdo commented 6 years ago

What does seem to work is rescaling the images to 256x256 and then resizing them back to the original resolution (1024x2048) before feeding them to the network (as suggested by @FishYuLi).

I get the following segmentations:

And these scores on the frankfurt images:

Mean pixel accuracy: 0.807152
Mean class accuracy: 0.252765
Mean class IoU: 0.204740
************ Per class numbers below ************
road           : acc = 0.921280, iou = 0.883280
sidewalk       : acc = 0.397364, iou = 0.273601
building       : acc = 0.925965, iou = 0.615736
wall           : acc = 0.000053, iou = 0.000051
fence          : acc = 0.000208, iou = 0.000207
pole           : acc = 0.003642, iou = 0.003605
traffic light  : acc = 0.000012, iou = 0.000012
traffic sign   : acc = 0.001757, iou = 0.001735
vegetation     : acc = 0.886809, iou = 0.787818
terrain        : acc = 0.199277, iou = 0.190027
sky            : acc = 0.842872, iou = 0.743765
person         : acc = 0.001945, iou = 0.001859
rider          : acc = 0.000000, iou = 0.000000
car            : acc = 0.621356, iou = 0.388359
truck          : acc = 0.000000, iou = 0.000000
bus            : acc = 0.000000, iou = 0.000000
train          : acc = 0.000000, iou = 0.000000
motorcycle     : acc = 0.000000, iou = 0.000000
bicycle        : acc = 0.000000, iou = 0.000000

tinghuiz commented 6 years ago

Glad that it worked out. But if you have a folder of 256x256 images, this line should do the scaling for you to the original resolution. Did you need to an extra scaling before running the code?

tychovdo commented 6 years ago

Yes, thanks.

@tinghuiz that’s right (if you resize the images to 256x256 and keep the labels/ground-truth segmentations in their original higher resolution).

MoeinSorkhei commented 4 years ago

Hi @tychovdo,

I have read the discussion here and the discussion here regarding generating the FCN score. Having followed what you did, I am still unable to get meaningful predictions from the FCN model. I am just trying the original validation images from the original Cityscapes dataset (1024x2048), resized them to 256x256, and then resized them back to 1024x2048 before giving it to the model. I am using the resize function from skimage.transform because scipy.misc. imresize function being deprecated. I am getting the following prediction as an example (the third line being the prediciton). Do you have any thoughts on this?

When you resize the original image to 256x256, do you also resize the corresponding GT segmentation and the label image based on that? (I think the label image cannot easily be resized with e.g. bilinear interpolation as the labels are integer and not RGB values)
Could you provide your code snippet for getting the FCN score?

I appreciate your time.

0_input

0_gt 0_pred

junyanz commented 4 years ago

We don't resize the ground truth prediction. Please see this note for more details.

The pre-trained model is not supposed to work on Cityscapes in the original resolution (1024x2048) as it was trained on 256x256 images that are upsampled to 1024x2048. The purpose of the resizing was to 1) keep the label maps in the original high resolution untouched and 2) avoid the need of changing the standard FCN training code for Cityscapes. To get the ground-truth numbers in the paper, you need to resize the original Cityscapes images to 256x256 before running the evaluation code.

MoeinSorkhei commented 4 years ago

Thanks for your response. Yes, exactly, I carefully read your updated notes on evaluating on Cityscapes. I am resizing the real images to 256x256 (with the resize function of the PIL package) before running the script and keep the labels/segmentations untouched. The only change I made to your script is:

to use the resize function of the PIL package rather than scipy.misc.imresize since it is deprecated.
Saving the images with the imsave of the skimage.io library since scipy.misc.imsave is deprecated.
I am running Caffe on CPU

To make sure the problem is not from saving, I used np.bincount to check different labels in the output of the Caffe model for the first image of Frankfurt city in validation set, and here is the frequency of the generated labels: [(0, 2096488), (1, 247), (2, 14), (8, 3), (10, 1), (13, 399)].

So my problem is mainly with the output of the semantic classifier. I will further investigate it, as I see in other threads that some people have managed to solve the issue (@FishYuLi I would be happy if you have any thoughts on this).

tinghuiz commented 4 years ago

Hi @MoeinSorkhei , just a guess. is it possible that your resize function of PIL scales the range of pixel values differently than scipy.misc.imresize? E.g. resize in PIL might convert uint8 [0,255] to float [0,1]?

MoeinSorkhei commented 4 years ago

Hi @tinghuiz , Thanks for the suggestion.

I investigated it, and indeed the range of the output of PIL.Image.resize() is between 0-255, so I believe it does not convert to float. I have been using the Caffe installed from the Anaconda repository, but now I try to follow steps on the official website for installation, although I think this should not make a difference.

MoeinSorkhei commented 4 years ago

Hi,

I am giving an update in case this might be helpful to someone: I was finally able to get numbers similar to the paper (for original images) for the first few images in the validation set.

What I did was to install Caffe (with GPU support) from this repository, and to use exactly the scipy.mis.imsave and scipy.misc.resizefunctions for saving and resizing the images respectively (as is in the code). I used scipy=1.0.0 in which the these functions are available.

Although the corresponding PIL functions (for resizing and saving images) seem to be functionally similar to those of scipy, I was able to reproduce similar numbers by using only scipy functions.

ErikVester commented 4 years ago

Hi,

I am giving an update in case this might be helpful to someone: I was finally able to get numbers similar to the paper (for original images) for the first few images in the validation set.

What I did was to install Caffe (with GPU support) from this repository, and to use exactly the scipy.mis.imsave and scipy.misc.resizefunctions for saving and resizing the images respectively (as is in the code). I used scipy=1.0.0 in which the these functions are available.

Although the corresponding PIL functions (for resizing and saving images) seem to be functionally similar to those of scipy, I was able to reproduce similar numbers by using only scipy functions.

Hi,

Did you run into any problems with memory whilst running the caffe model? I am running it on a GPU with 12GB and instantly get an out of memory error, even when I try to run it on a small data set. It is the following error: Check failed: error == cudaSuccess (2 vs. 0) out of memory.

Any help would be highly appreciated!

Kind regards,

Erik

MoeinSorkhei commented 4 years ago

Hi, I am giving an update in case this might be helpful to someone: I was finally able to get numbers similar to the paper (for original images) for the first few images in the validation set. What I did was to install Caffe (with GPU support) from this repository, and to use exactly the scipy.mis.imsave and scipy.misc.resizefunctions for saving and resizing the images respectively (as is in the code). I used scipy=1.0.0 in which the these functions are available. Although the corresponding PIL functions (for resizing and saving images) seem to be functionally similar to those of scipy, I was able to reproduce similar numbers by using only scipy functions.

Hi,

Did you run into any problems with memory whilst running the caffe model? I am running it on a GPU with 12GB and instantly get an out of memory error, even when I try to run it on a small data set. It is the following error: Check failed: error == cudaSuccess (2 vs. 0) out of memory.

Any help would be highly appreciated!

Kind regards,

Erik

Hi,

You should not get this error if you evaluate 1 image at a time. Are you using the provided code for evaluation? In that, the images are evaluated one by one in a for loop, and the GPU that I used (with 11GB memory) was able to perform the forward pass for evaluating the images.

ErikVester commented 4 years ago

Hi, I am giving an update in case this might be helpful to someone: I was finally able to get numbers similar to the paper (for original images) for the first few images in the validation set. What I did was to install Caffe (with GPU support) from this repository, and to use exactly the scipy.mis.imsave and scipy.misc.resizefunctions for saving and resizing the images respectively (as is in the code). I used scipy=1.0.0 in which the these functions are available. Although the corresponding PIL functions (for resizing and saving images) seem to be functionally similar to those of scipy, I was able to reproduce similar numbers by using only scipy functions.

Hi, Did you run into any problems with memory whilst running the caffe model? I am running it on a GPU with 12GB and instantly get an out of memory error, even when I try to run it on a small data set. It is the following error: Check failed: error == cudaSuccess (2 vs. 0) out of memory. Any help would be highly appreciated! Kind regards, Erik

Hi,

You should not get this error if you evaluate 1 image at a time. Are you using the provided code for evaluation? In that, the images are evaluated one by one in a for loop, and the GPU that I used (with 11GB memory) was able to perform the forward pass for evaluating the images.

Hi,

Thanks for the quick reply! I am running it on Colab, which should give 12GB (or even more I think). I do run it with the provided evaluate.py file (under scripts/eval_cityscapes), which has the loop in it. Also followed your tips for the resizing, thanks for that! Weird that it worked for you with 11GB, should be something else still then..

I am currently running it on CPU, which takes quite long but it does seem so stay within the limit of 25GB memory (it uses 23GB now).

Just to be sure, you also only resized (to 256x256) the images in leftImg8bit and the ones ending on _color.png in gtFine?

Kind regards, Erik

MoeinSorkhei commented 4 years ago

Hi,

Actually the amount of CPU memory that I allocate for running this job is at most 15GB, so I think you are doing something unnecessary here. No, the images ending in _color.png should not be resized at all (as mentioned in the instructions of the repository). Only the leftImg8bit images are resized to 256x256 (before running the script), and in the script, they are automatically resized back to the size of the _color.png images (which is 1024x2048).

Best, Moein

ErikVester commented 4 years ago

Hi,

Thanks again! I have got that working now. Last question: did you also resize the results from testing or do you keep those original size as well?

Kind regards, Erik

MoeinSorkhei commented 4 years ago

Hi,

Thanks again! I have got that working now. Last question: did you also resize the results from testing or do you keep those original size as well?

Kind regards, Erik

Hi,

What do you exactly mean by the results from testing?

ErikVester commented 4 years ago

Hi, Thanks again! I have got that working now. Last question: did you also resize the results from testing or do you keep those original size as well? Kind regards, Erik

Hi,

What do you exactly mean by the results from testing?

The output of our trained model.

MoeinSorkhei commented 4 years ago

Hi, Thanks again! I have got that working now. Last question: did you also resize the results from testing or do you keep those original size as well? Kind regards, Erik

Hi, What do you exactly mean by the results from testing?

The output of our trained model.

If you mean the images that are to be evaluated by the FCN model, the answer is yes. Every generated image that is to be evaluated by the FCN model should be of size 256x256.

Let me know if I still understand your question wrongly.

junyanz commented 4 years ago

We updated the evaluation description. It might help.

ErikVester commented 4 years ago

Hi,

All clear now, thank you for the help! And thanks for updating the description, definitely helps. Thought I still share my final results. I get the following accuracies on the model that uses CGan + L1:

So somehow a bit higher than the values in the paper. I will leave it at these results for now, but could these be reasonable results or is something definitely still wrong here and should they have been more close to the values in the paper?

Kind regards,

Erik

junyanz commented 4 years ago

It looks reasonable. Our paper's numbers are based on models trained with Torch repo. We expect a slight difference between PyTorch models and Torch models. Sometimes better sometimes worse.

ErikVester commented 4 years ago

Ok, great! Thanks again for the help.

MoeinSorkhei commented 4 years ago

Hi,

I am using another generative model to generate images of different sizes. When I generate 128x256 (height 128, width 256), the FCN score would be reasonable. However, when I evaluate generated images of size 256x512, I get scores that are higher than the ground truth. I thought evaluating 256x512 images with your FCN model would be OK because I resize all the generated images to 256x256 before feeding to the FCN model. It seems I can only evaluate images that are actually of size 256x256 at generation time and resizing after generating images (before feeding into the FCN model) to 256x256 would reproduce wrong results. Do you have any thoughts on this?

Thes are the numbers I get: Image size	Mean pixel acc.	Mean class acc.	Mean class IoU
128x256	0.735	0.238	0.198
256x512	0.845	0.292	0.247

And this is ground truth 256x256 (similar to the paper):

Mean pixel acc.	Mean class acc.	Mean class IoU
0.8	0.26	0.21

I appreciate your thought.

phillipi / pix2pix

Evaluating Cityscapes #148