Closed tychovdo closed 6 years ago
Are you able to re-produce the ground-truth numbers by running the provided script?
Not for the FCN scores. The pre-trained caffe model doesn't seem to give correct outputs.
What's the number you are getting?
The provided script does not resize the images down to 256x256 (due to an outcommented line). When I run the script on the ground-truth images in "gtFine/val/frankfurt" and look at the images outputted by the pretrained model I get:
input:(1024x2048)
segmentation:(1024x2048)
ground-truth: (1024x2048)
Rescaling the images to 256x256 before feeding them to the pretrained model does not seem to help:
input: (256x256)
segmentation: (256x256)
rescaled segmentation: (256x256)
ground-truth: (1024x2048)
Did you get better looking segmentation masks?
Evaluating on the first 20 images in "gtFine/val/frankfurt" using 256x256 scaling results in these scores:
Mean pixel accuracy: 0.424817
Mean class accuracy: 0.054131
Mean class IoU: 0.024102
************ Per class numbers below ************
road : acc = 0.999520, iou = 0.429499
sidewalk : acc = 0.000478, iou = 0.000376
building : acc = 0.025424, iou = 0.025024
wall : acc = 0.000000, iou = 0.000000
fence : acc = 0.000000, iou = 0.000000
pole : acc = 0.000097, iou = 0.000095
traffic light : acc = 0.000000, iou = 0.000000
traffic sign : acc = 0.000238, iou = 0.000225
vegetation : acc = 0.000021, iou = 0.000021
terrain : acc = 0.000000, iou = 0.000000
sky : acc = 0.002707, iou = 0.002705
person : acc = 0.000000, iou = 0.000000
rider : acc = 0.000000, iou = 0.000000
car : acc = 0.000000, iou = 0.000000
truck : acc = 0.000000, iou = 0.000000
bus : acc = 0.000000, iou = 0.000000
train : acc = 0.000000, iou = 0.000000
motorcycle : acc = 0.000000, iou = 0.000000
bicycle : acc = 0.000000, iou = 0.000000
So, pretty bad, but as expected (when taking into account that the segmentation masks is classifying almost everything as "road").
Just to make sure, to get the ground-truth number, did you first construct a folder of original Cityscapes images resized to 256x256 and then run the provided script without modification?
python ./scripts/eval_cityscapes/evaluate.py --cityscapes_dir /path/to/original/cityscapes/dataset/ --result_dir /path/to/resized/images/ --output_dir /path/to/output/directory/
The results above we obtained using a modified version of the script. Now I tried to resize the images to 256x256 and run the provided script without modifications and get similar results:
Mean pixel accuracy: 0.429819
Mean class accuracy: 0.054688
Mean class IoU: 0.024783
************ Per class numbers below ************
road : acc = 0.999237, iou = 0.431893
sidewalk : acc = 0.001446, iou = 0.001225
building : acc = 0.031437, iou = 0.030817
wall : acc = 0.000000, iou = 0.000000
fence : acc = 0.000000, iou = 0.000000
pole : acc = 0.000000, iou = 0.000000
traffic light : acc = 0.000000, iou = 0.000000
traffic sign : acc = 0.000000, iou = 0.000000
vegetation : acc = 0.000000, iou = 0.000000
terrain : acc = 0.000000, iou = 0.000000
sky : acc = 0.006945, iou = 0.006943
person : acc = 0.000000, iou = 0.000000
rider : acc = 0.000000, iou = 0.000000
car : acc = 0.000000, iou = 0.000000
truck : acc = 0.000000, iou = 0.000000
bus : acc = 0.000000, iou = 0.000000
train : acc = 0.000000, iou = 0.000000
motorcycle : acc = 0.000000, iou = 0.000000
bicycle : acc = 0.000000, iou = 0.000000
0_input.jpg (256x256):
0_pred.jpg (256x256):
0_gt.jpg (256x256):
These are also numbers from the first 20 images? Is it possible for you to run on the entire test set or does it take too long?
What does seem to work is rescaling the images to 256x256 and then resizing them back to the original resolution (1024x2048) before feeding them to the network (as suggested by @FishYuLi).
I get the following segmentations:
And these scores on the frankfurt images:
Mean pixel accuracy: 0.807152
Mean class accuracy: 0.252765
Mean class IoU: 0.204740
************ Per class numbers below ************
road : acc = 0.921280, iou = 0.883280
sidewalk : acc = 0.397364, iou = 0.273601
building : acc = 0.925965, iou = 0.615736
wall : acc = 0.000053, iou = 0.000051
fence : acc = 0.000208, iou = 0.000207
pole : acc = 0.003642, iou = 0.003605
traffic light : acc = 0.000012, iou = 0.000012
traffic sign : acc = 0.001757, iou = 0.001735
vegetation : acc = 0.886809, iou = 0.787818
terrain : acc = 0.199277, iou = 0.190027
sky : acc = 0.842872, iou = 0.743765
person : acc = 0.001945, iou = 0.001859
rider : acc = 0.000000, iou = 0.000000
car : acc = 0.621356, iou = 0.388359
truck : acc = 0.000000, iou = 0.000000
bus : acc = 0.000000, iou = 0.000000
train : acc = 0.000000, iou = 0.000000
motorcycle : acc = 0.000000, iou = 0.000000
bicycle : acc = 0.000000, iou = 0.000000
Glad that it worked out. But if you have a folder of 256x256 images, this line should do the scaling for you to the original resolution. Did you need to an extra scaling before running the code?
Yes, thanks.
@tinghuiz that’s right (if you resize the images to 256x256 and keep the labels/ground-truth segmentations in their original higher resolution).
Hi @tychovdo,
I have read the discussion here and the discussion here regarding generating the FCN score. Having followed what you did, I am still unable to get meaningful predictions from the FCN model. I am just trying the original validation images from the original Cityscapes dataset (1024x2048), resized them to 256x256, and then resized them back to 1024x2048 before giving it to the model. I am using the resize
function from skimage.transform
because scipy.misc. imresize
function being deprecated. I am getting the following prediction as an example (the third line being the prediciton). Do you have any thoughts on this?
bilinear
interpolation as the labels are integer and not RGB values)I appreciate your time.
We don't resize the ground truth prediction. Please see this note for more details.
The pre-trained model is not supposed to work on Cityscapes in the original resolution (1024x2048) as it was trained on 256x256 images that are upsampled to 1024x2048. The purpose of the resizing was to 1) keep the label maps in the original high resolution untouched and 2) avoid the need of changing the standard FCN training code for Cityscapes. To get the ground-truth numbers in the paper, you need to resize the original Cityscapes images to 256x256 before running the evaluation code.
Thanks for your response.
Yes, exactly, I carefully read your updated notes on evaluating on Cityscapes. I am resizing the real images to 256x256 (with the resize
function of the PIL
package) before running the script and keep the labels/segmentations untouched. The only change I made to your script is:
resize
function of the PIL
package rather than scipy.misc.imresize
since it is deprecated.imsave
of the skimage.io
library since scipy.misc.imsave
is deprecated.To make sure the problem is not from saving, I used np.bincount
to check different labels in the output of the Caffe model for the first image of Frankfurt city in validation set, and here is the frequency of the generated labels: [(0, 2096488), (1, 247), (2, 14), (8, 3), (10, 1), (13, 399)]
.
So my problem is mainly with the output of the semantic classifier. I will further investigate it, as I see in other threads that some people have managed to solve the issue (@FishYuLi I would be happy if you have any thoughts on this).
Hi @MoeinSorkhei , just a guess. is it possible that your resize
function of PIL
scales the range of pixel values differently than scipy.misc.imresize
? E.g. resize
in PIL
might convert uint8
[0,255] to float
[0,1]?
Hi @tinghuiz , Thanks for the suggestion.
I investigated it, and indeed the range of the output of PIL.Image.resize()
is between 0-255, so I believe it does not convert to float
.
I have been using the Caffe installed from the Anaconda repository, but now I try to follow steps on the official website for installation, although I think this should not make a difference.
Hi,
I am giving an update in case this might be helpful to someone: I was finally able to get numbers similar to the paper (for original images) for the first few images in the validation set.
What I did was to install Caffe (with GPU support) from this repository, and to use exactly the scipy.mis.imsave
and scipy.misc.resize
functions for saving and resizing the images respectively (as is in the code). I used scipy=1.0.0
in which the these functions are available.
Although the corresponding PIL
functions (for resizing and saving images) seem to be functionally similar to those of scipy
, I was able to reproduce similar numbers by using only scipy
functions.
Hi,
I am giving an update in case this might be helpful to someone: I was finally able to get numbers similar to the paper (for original images) for the first few images in the validation set.
What I did was to install Caffe (with GPU support) from this repository, and to use exactly the
scipy.mis.imsave
andscipy.misc.resize
functions for saving and resizing the images respectively (as is in the code). I usedscipy=1.0.0
in which the these functions are available.Although the corresponding
PIL
functions (for resizing and saving images) seem to be functionally similar to those ofscipy
, I was able to reproduce similar numbers by using onlyscipy
functions.
Hi,
Did you run into any problems with memory whilst running the caffe model? I am running it on a GPU with 12GB and instantly get an out of memory error, even when I try to run it on a small data set. It is the following error: Check failed: error == cudaSuccess (2 vs. 0) out of memory.
Any help would be highly appreciated!
Kind regards,
Erik
Hi, I am giving an update in case this might be helpful to someone: I was finally able to get numbers similar to the paper (for original images) for the first few images in the validation set. What I did was to install Caffe (with GPU support) from this repository, and to use exactly the
scipy.mis.imsave
andscipy.misc.resize
functions for saving and resizing the images respectively (as is in the code). I usedscipy=1.0.0
in which the these functions are available. Although the correspondingPIL
functions (for resizing and saving images) seem to be functionally similar to those ofscipy
, I was able to reproduce similar numbers by using onlyscipy
functions.Hi,
Did you run into any problems with memory whilst running the caffe model? I am running it on a GPU with 12GB and instantly get an out of memory error, even when I try to run it on a small data set. It is the following error: Check failed: error == cudaSuccess (2 vs. 0) out of memory.
Any help would be highly appreciated!
Kind regards,
Erik
Hi,
You should not get this error if you evaluate 1 image at a time. Are you using the provided code for evaluation? In that, the images are evaluated one by one in a for loop, and the GPU that I used (with 11GB memory) was able to perform the forward pass for evaluating the images.
Hi, I am giving an update in case this might be helpful to someone: I was finally able to get numbers similar to the paper (for original images) for the first few images in the validation set. What I did was to install Caffe (with GPU support) from this repository, and to use exactly the
scipy.mis.imsave
andscipy.misc.resize
functions for saving and resizing the images respectively (as is in the code). I usedscipy=1.0.0
in which the these functions are available. Although the correspondingPIL
functions (for resizing and saving images) seem to be functionally similar to those ofscipy
, I was able to reproduce similar numbers by using onlyscipy
functions.Hi, Did you run into any problems with memory whilst running the caffe model? I am running it on a GPU with 12GB and instantly get an out of memory error, even when I try to run it on a small data set. It is the following error: Check failed: error == cudaSuccess (2 vs. 0) out of memory. Any help would be highly appreciated! Kind regards, Erik
Hi,
You should not get this error if you evaluate 1 image at a time. Are you using the provided code for evaluation? In that, the images are evaluated one by one in a for loop, and the GPU that I used (with 11GB memory) was able to perform the forward pass for evaluating the images.
Hi,
Thanks for the quick reply! I am running it on Colab, which should give 12GB (or even more I think). I do run it with the provided evaluate.py file (under scripts/eval_cityscapes), which has the loop in it. Also followed your tips for the resizing, thanks for that! Weird that it worked for you with 11GB, should be something else still then..
I am currently running it on CPU, which takes quite long but it does seem so stay within the limit of 25GB memory (it uses 23GB now).
Just to be sure, you also only resized (to 256x256) the images in leftImg8bit and the ones ending on _color.png in gtFine?
Kind regards, Erik
Hi,
Actually the amount of CPU memory that I allocate for running this job is at most 15GB, so I think you are doing something unnecessary here. No, the images ending in _color.png should not be resized at all (as mentioned in the instructions of the repository). Only the leftImg8bit images are resized to 256x256 (before running the script), and in the script, they are automatically resized back to the size of the _color.png images (which is 1024x2048).
Best, Moein
Hi,
Thanks again! I have got that working now. Last question: did you also resize the results from testing or do you keep those original size as well?
Kind regards, Erik
Hi,
Thanks again! I have got that working now. Last question: did you also resize the results from testing or do you keep those original size as well?
Kind regards, Erik
Hi,
What do you exactly mean by the results from testing?
Hi, Thanks again! I have got that working now. Last question: did you also resize the results from testing or do you keep those original size as well? Kind regards, Erik
Hi,
What do you exactly mean by the results from testing?
The output of our trained model.
Hi, Thanks again! I have got that working now. Last question: did you also resize the results from testing or do you keep those original size as well? Kind regards, Erik
Hi, What do you exactly mean by the results from testing?
The output of our trained model.
If you mean the images that are to be evaluated by the FCN model, the answer is yes. Every generated image that is to be evaluated by the FCN model should be of size 256x256.
Let me know if I still understand your question wrongly.
We updated the evaluation description. It might help.
Hi,
All clear now, thank you for the help! And thanks for updating the description, definitely helps. Thought I still share my final results. I get the following accuracies on the model that uses CGan + L1:
So somehow a bit higher than the values in the paper. I will leave it at these results for now, but could these be reasonable results or is something definitely still wrong here and should they have been more close to the values in the paper?
Kind regards,
Erik
It looks reasonable. Our paper's numbers are based on models trained with Torch repo. We expect a slight difference between PyTorch models and Torch models. Sometimes better sometimes worse.
Ok, great! Thanks again for the help.
Hi,
I am using another generative model to generate images of different sizes. When I generate 128x256 (height 128, width 256), the FCN score would be reasonable. However, when I evaluate generated images of size 256x512, I get scores that are higher than the ground truth. I thought evaluating 256x512 images with your FCN model would be OK because I resize all the generated images to 256x256 before feeding to the FCN model. It seems I can only evaluate images that are actually of size 256x256 at generation time and resizing after generating images (before feeding into the FCN model) to 256x256 would reproduce wrong results. Do you have any thoughts on this?
Thes are the numbers I get: Image size | Mean pixel acc. | Mean class acc. | Mean class IoU |
---|---|---|---|
128x256 | 0.735 | 0.238 | 0.198 |
256x512 | 0.845 | 0.292 | 0.247 |
And this is ground truth 256x256 (similar to the paper):
Mean pixel acc. | Mean class acc. | Mean class IoU |
---|---|---|
0.8 | 0.26 | 0.21 |
I appreciate your thought.
Hi,
I'm having difficulties reproducing the results from the CycleGAN paper for the cityscapes evaluation. For the city->label classification scores I get very similar results. But, for the label->photo FCN score experiment I get really bad results. I used the code from the ./scripts/eval_cityscapes folder and trimmed it down a bit to find the error (see code below): I load a single image from the cityscapes dataset, resize and preprocess it using the code from the repo and then perform a forward pass through the pretrained caffe model.
Unfortunately, the caffe model outputs mostly 0s. Do you have any suggestions?
^Left to right: "orig", "resized" and "segmented"
Thanks in advance.