Wrong loss while computing mean.

qzchenwl commented 6 years ago

Suppose we have 3 samples, each y_truth and y_pred as following:

import numpy as np
import tensorflow as tf
from keras import backend as K

y_truth = np.array([
    [0, 0, 0, 0, 0, 0, 0, 0.9, 0.1, 0],
    [0, 0, 0, 0, 0, 0, 0, 0.9, 0.1, 0],
    [0, 0, 0, 0, 0, 0, 0, 0.9, 0.1, 0]])

y_pred = np.array([
    [0, 0, 0, 0, 0, 0, 0.9, 0, 0.1, 0],
    [0.9, 0, 0, 0, 0, 0, 0, 0, 0.1, 0],
    [0.8, 0, 0, 0, 0, 0, 0, 0, 0.1, 0.1]])

Each sample has its own EMD. Let's calculate them.

sess = tf.Session()

def emd_sample_wise(y_truth, y_pred):
    return K.sqrt(K.mean(K.square(K.abs(K.cumsum(y_truth, axis=-1) - K.cumsum(y_pred, axis=-1))), axis=-1))

sess.run(emd_sample_wise(y_truth, y_pred))
## array([ 0.28460499,  0.75299402,  0.67082039])

Looks right. First sample has the smallest loss.

Now, we need the mean of them, which is (0.28460499 + 0.75299402 + 0.67082039)/3 = 0.56947313

def emd(y_truth, y_pred):
    return K.mean(emd_sample_wise(y_truth, y_pred))

sess.run(emd(y_truth, y_pred))
## 0.56947313551525303

Comparing with the original version:

def earth_mover_loss(y_true, y_pred):
    return K.sqrt(K.mean(K.square(K.abs(K.cumsum(y_true, axis=-1) - K.cumsum(y_pred, axis=-1)))))

sess.run(earth_mover_loss(y_truth, y_pred))
## 0.60497933849016694

That's incorrect.

titu1994 commented 6 years ago

Damn. By the paper, this fix is correct since it originally takes loss for only one sample, not a batch of samples.

I'm just about done training the current 15th epoch with the loss from yesterday. I am already seeing improvements in the current model from the updated loss measure (from #2).

However, at this point, I'm not about to retrain the entire model with this new loss again. I'll probably fine tune the current model for 10 more epochs on this loss, hopefully that gets some improvements.

titu1994 commented 6 years ago

For now, I've pushed an update with the new weights trained for 25 epochs on the loss from #2.

I'll fine tune that model for a further 10 epochs on the correction from here. The EMD loss measure has been updated in the current version.

titu1994 commented 6 years ago

Weights have been updated, and so has the image. Scores seem a little more variated now, though I dont agree with the scores for the bottom right most image (that's subjective though I suppose).

Final loss for MobileNet was close to 0.0804 EMD for the train set and 0.0805 EMD for the val set, though this is not directly comparable with the paper since I used a far smaller validation set than them.

qzchenwl commented 6 years ago

@titu1994 What's the validation set of the paper?

This result is better NIMA(MobileNet) in the paper, which is 0.081.

BTW, how to you generate images like images/NIMA.jpg, images/NIMA2.jpg.

qzchenwl commented 6 years ago

I think image should be resized to (224,224) before feed to model.predict. evaluate_mobilenet.py#L22

titu1994 commented 6 years ago

That's not necessary. It's only during training time that you need to have a fixed image size of that shape.

During inference, it's fine to have larger images since global average pooling will anyway get just the features fed to the dense layer.

tfriedel commented 6 years ago

I did make a small experiment using some extreme images (very good and very bad scores) and the scores I got when resizing seemed better. The good ones had scores higher than six and the bad ones around 3-4.x. When I did no resizing the scores for the good ones were around 5.3 and the bad ones around 4.1. Note I did use an Inception Resnet V2 network, which got me a loss of 0.070. I think the images should be resized. Because during training the finer details are not available, so if they are available during evaluation, they will be of no use. Also inference on a big image will be slower of course.

I also noticed you are resizing the images to a square image. This will change the aspect ratio. It's possible that this is a problem, because the proportions will change and if for example the network picks up on things like rule of thirds, it will make an error. However, the other option, resizing and having black bars might have other problems (black bars confusing the network?). My intuition is, it might not have a big effect, but I would go with not changing the aspect ratio.

titu1994 commented 6 years ago

Interesting results. I'll try MobileNet with 224x224 scale images later.

An important problem is that if you scale images according to aspect ratio, then it may cause shape issues on NASNet. Also, the original images were trained on square image crops. There should be no major change in scores due to squashing the image to a square.

titu1994 commented 6 years ago

Also, care sharing the trained Inception ResNet V2 weights ? Inference is still possible on even very small GPUs

titu1994 commented 6 years ago

By weights, I mean only the final layers weights. Not the entire model weights which will require several hundred MB

tfriedel commented 6 years ago

I did try training the final layers only, but got a loss of only about 0.085 by doing so. Later I did a couple of passes on the whole network with pretty good results (though very slow... had to use batch_size=16 and every epoch took 3 hours). Here are the weights for the whole model: https://drive.google.com/open?id=15UCuYrT65Hdiu57iEo17Z4bZYR8O_B9l

There's also a different resizing code, which preserves the aspect ratio but adds black bars to the smaller side, to make the final image a square one.

titu1994 commented 6 years ago

Thanks a lot for the weights ! I'm assuming this is the Inception ResNet V2 model from the Keras master repo ?

Also, as I thought, fine-tuning the entire network seems to he important for the larger models such as VGG, Inception and etc. Even NASNet stalls at 0.085 exactly.

Funny how MobileNet can reach 0.0804 without full fine-tuning, but NASNet Mobile, Inception ResNet v2 can't do the same.

titu1994 commented 6 years ago

Also, why is there a need to use black sides padding ? A simple 224x224 resize is sufficient no?

titu1994 commented 6 years ago

@tfriedel Hmm this is in fact quite weird. Look at the scores for the 6 images from the readme, reading left to right as 1-6. These are the scores according to the Inception ResNet v2 model :

With resizing to 224x224 -

Evaluating : images/art1.jpg NIMA Score : 5.453 +- (1.519)

Evaluating : images/art2.jpg NIMA Score : 4.677 +- (1.550)

Evaluating : images/art3.jpg NIMA Score : 4.974 +- (1.530)

Evaluating : images/art4.jpg NIMA Score : 5.641 +- (1.554)

Evaluating : images/art5.jpg NIMA Score : 6.182 +- (1.460)

Evaluating : images/art6.jpg NIMA Score : 6.326 +- (1.443)

I certainly don't think the Starry Night by Van Gogh (2) is that inferior in perception to the fantasy painting from a game (6), blue strokes in oil pain (4) or another fantasy anime painting (5)..

If we don't resize the images, and run them at full resolution, these are the scores :

Evaluating : images/art1.jpg NIMA Score : 4.035 +- (1.697)

Evaluating : images/art2.jpg NIMA Score : 5.024 +- (1.874)

Evaluating : images/art3.jpg NIMA Score : 4.825 +- (1.726)

Evaluating : images/art4.jpg NIMA Score : 5.740 +- (1.678)

Evaluating : images/art5.jpg NIMA Score : 5.526 +- (1.800)

Evaluating : images/art6.jpg NIMA Score : 4.088 +- (1.736)

I consider these to be a more accurate representation of the scores. I do disagree that once again, Starry Night (2) is scoring lower than blue strokes in oil painting (4) and a fantasy art from anime (5).

How was this networks trained ? Was it trained using the resize script which pads with black after resizing to match aspect ratio? That may be the reason for such weird scores.

tfriedel commented 6 years ago

Yes it's Resnet Inception v2 from the keras master repo.

Yeah interesting that you got these good scores with mobile net by only training the FC layers. I'm currently finetuning NASnet mobile, seems to be almost as good as inception resnet v2 already. Although it's not a fair comparison, since i didn't train inception resnet v2 very long (only about 2,5 epochs on the full net).

I did the padding because I didn't want to distort the images. Like I said, proportions will be destroyed and I thought they might damage the scores a little bit. Anyway the proper to find out if it's true is to make an experiment. Or look if somebody did one. Actually somebody did, though their method is different, maybe their results translate for this point: https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Mai_Composition-Preserving_Deep_Photo_CVPR_2016_paper.pdf They got these results: VGG-Crop 71.2% 0.83 0.66 VGG-Scale 73.8% 0.83 0.74 VGG-Pad 72.9% 0.83 0.73

So scaling is a little bit better than padding & scaling.

The images were scaled to 256x256 with the shorter side padded and then randomly cropped to 224x224.

I do a sanity check by comparing the scores I get with actual scores from the images in the dataset, since everything else is more subjective. I'm not even sure if there's paintings or video game renderings in the dataset. It seems more like a photo database and I assume it will not easily translate to other domains. This has been a conclusion in some of the other aesthetics papers.

So this is the output of my sanity test: pred: 6.757026312407106 actual: 8.200000052340329 pred: 6.5784116628346965 actual: 8.144654086790979 pred: 6.3219472915516235 actual: 7.768656808882952 pred: 6.56873318873113 actual: 7.854771941900253 pred: 6.416181717271684 actual: 7.721951327286661 pred: 6.262039204360917 actual: 7.341853000223637 pred: 6.481540510663763 actual: 8.19387736171484 pred: 6.561742648889776 actual: 7.796078726649284 pred: 6.185421533999033 actual: 7.743421067483723

pred: 4.52153702173382 actual: 2.764258526265621 pred: 4.678500899113715 actual: 2.5428571277298033 pred: 4.1977894899901 actual: 2.9330144026316702 pred: 4.614314584992826 actual: 2.9318181835114956 pred: 3.6885588343720883 actual: 3.3403362492099404 pred: 5.7214088421314955 actual: 2.619565237313509 pred: 3.4074650693219155 actual: 2.5792079218663275 pred: 3.7801399384625256 actual: 2.2962963059544563 pred: 3.8599004802526906 actual: 1.9887640252709389

Sorry the actual image numbers are not printed, if you need them, I can modify the code and run it again.

titu1994 commented 6 years ago

Hmm, interesting. I understand that the network has not learned enough to reduce the "distance" between each of the individual images ground truth and its own score (which would mean it overfit).

More importantly, AVA wasn't trained on may paintings, but was trained on a large set of abstract high quality art. I was hoping that would transfer to generic art, but it may not be the case.

Anyway, two important findings are that MobileNet doesnt improve much (only 0.0007 increase), even when I finetune the entire network, and that larger more powerful models tend to do better - upto a certain point. Inception ResNet v2 is more powerful than Inception v2 and VGG. However, training on a single GPU, that too for 3 hours per epoch is just ridiculous, no matter how fast inference may be.

Can I ask - did you startoff with training the entire network from scratch? Or did you take the approach of first extracting the features, training just the classifier and then training the entire network with the pre-trained classifier? I think the stalling at 0.085 when pretraining the classifier is because it overfits the networks features.

Seemingly, if you were able to get 0.07~ with just 2.5 epochs of full training, perhaps with more careful pre-training (with regularization, as long as it doesnt overfit), the results might converge faster?

titu1994 commented 6 years ago

Also, this may very well be the reason for the wierd scores - "The images were scaled to 256x256 with the shorter side padded and then randomly cropped to 224x224."

I don't know if this is a good strategy at all really. By the paper you show, seems simply scaling the image down performs better than what you suggest with the pad and crop.

tfriedel commented 6 years ago

I do "warmup" the network by first generating network features, saving them to bcolz files and then training a small network only consisting of the top layers. In the case of the inception resnet v2 network I then finetuned more of the top layers, because this is faster then the whole network. However, I got almost no improvement from doing this. So then I finetuned the whole network and this resulted in big improvements.

"I don't know if this is a good strategy at all really. By the paper you show, seems simply scaling the image down performs better than what you suggest with the pad and crop." Yeah I know, I did read that while the training was already running. So yeah, maybe I can do an experiment later by finetuning the network trained on padded&scaled images to only scaled images.

So far the nasnet mobile seems pretty good. loss: 0.058, val_loss: 0.067 Might be a bit overfitted, since val_los hasn't been improving much in the last epochs.

I think I'm going to implement an accuracy function like SRCC since just comparing the predicted scores with the actual scores is not a good way of measuring the performance. As long as the predicted scores sort the images in the correct order, who cares about their absolute values, right?

titu1994 commented 6 years ago

Warming up the network in such a manner is quite efficient and I do it as well for NASNet, so it's not a big problem. Ofc, I couldn't then fine-tune the net since it is so large, it simply doesn't fit on the GPU memory.

If you were able to update the inception network with the scaling only, I think it would do better. But that depends on of you have the time and resources, and are inclined on training it again with 3 hours per epoch.

NASNet right now is a mobile network, and it surpasses MobileNets significantly. I think you can stop the training process if it has not improved in the last 5 epochs. That's generally my criterion. Also, if you don't mind, could you share the NASNet weights ?

I'm wondering, if NASNet Mobile could beat Inception, just how much more powerful would the full NASNet be. In that regards, I'm starting up a VM on GCP to train the full NASNet. Let's hope the results are good.

I've actually been wondering how the paper actually got "accuracy" as a metric for this problem. If you manage to implement such a scoring metric, could you add a PR here ?

tfriedel commented 6 years ago

I think the inception network might have been to trained into a dead end. The authors got to 0.050 on inception v2, so I think it should be possible too with inception resnet v2. But I had a high learning rate since I wanted fast results. I think going with a proper learning rate schedule again from the original imagenet weights could get me there. I will do the experiment with the scaling vs padding with the nasnet mobile since, at this point it seems like the better network and tbh, I don't want to wait for the other one to train to a good accuracy at this point. I might give it another shot during the next nights. I wouldn't say it's fair to say nasnet "beats" inception, since I didn't train the inception net to it's full potential I think. But yeah the full nasnet is maybe the best network here too.

Here's the trained nasnet mobile so far: https://drive.google.com/open?id=15UCuYrT65Hdiu57iEo17Z4bZYR8O_B9l

Yeah, I will make a PR for the scoring metric if I'm successful. As far as I understand, the SRCC metric basically values that the items are ordered correctly. And they also had an accuracy metric which divides images in two classes "above average", "below average". Tbh scores around 80% for this problem don't look that impressive too me. On the other hand, most images will probably classify as "average", so saying this "average" image is "a little bit below average" or "a little bit above average" is again a hard problem in my opinion. That's way I prefer a ranking based scoring, cause that's what you probably want. Sort your images by score and pick the best ones or maybe drive some other image manipulation algorithm to improve the image by tweaking some parameters so the score improves. I can totally see this becoming a feature in camera apps, which guide the user how we can take a better picture, for example by telling him how he should move the camera to get a better score.

titu1994 commented 6 years ago

Thanks for the weights ! If I am reading this correctly, this NASNet Mobile weights is also trained on the crop + pad method you mentioned ?

Also, direct comparison with the paper won't be possible actually. They used 55,000 random images as the validation set and trained on only 200,000 images. On the other hand, we train on 250,000 images and validate on just 5000.

The accuracy metric is just wierd to me. I believe the ranking metric you suggest is more useful for end user purposes.

tfriedel commented 6 years ago

"this NASNet Mobile weights is also trained on the crop + pad method you mentioned ?" right

Ok we have a bit of an advantage since we have a couple of more images to train on and the comparison is not completely fair. However, our problem is easier, since we have more training images. And if we don't even get the scores they get on their validation set on our training set, we are not reaching the full potential yet. Also I think having 1/5 more images or less is not going to make a big difference. Btw I'm not using excactly the same split. I have a validation set of 10000. But it's not super important I think.

titu1994 commented 6 years ago

Thats true, we should be outperforming their scores. I think looking at their learning rate schedule of 0.95 decay every 10 epochs, they must have used a massive number of GPUs to train the models for at least a 100 epochs, if not even more. Thats not feasible for us, so we will have to get what we can in as little compute as possible.

That was my initial reasoning between not following their 20% validation set rule in the first place. I thought that since I wont have nearly enough compute, I set up to have more training data to train faster.

tfriedel commented 6 years ago

Ok, you can get the source for the srcc from my fork. I tried to integrate it in your code. I made this function which should be called after training:

    y_test = []
    y_pred = []
    gen = val_generator(batchsize=batchsize)
    for i in tqdm(range(5000 // batchsize)):
        batch = next(gen)
        y_test.append(batch[1])
        y_pred.append(model.predict_on_batch(batch[0]))
    y_test = np.concatenate(y_test)
    y_pred = np.concatenate(y_pred)
    rho = srcc(y_test, y_pred)
    print("srcc = {}".format(rho))

But I'm getting unfamiliar errors.

Exception ignored in: <generator object val_generator at 0x0000021881CBB570>
RuntimeError: generator ignored GeneratorExit
  0%|                                                                                                                                                                                                                    | 0/25 [00:00<?, ?it/s]
---------------------------------------------------------------------------
FailedPreconditionError                   Traceback (most recent call last)
s:\toolkits\anaconda3-4.4.0\lib\site-packages\tensorflow\python\client\session.py in _do_call(self, fn, *args)
   1322     try:
-> 1323       return fn(*args)
   1324     except errors.OpError as e:

s:\toolkits\anaconda3-4.4.0\lib\site-packages\tensorflow\python\client\session.py in _run_fn(session, feed_dict, fetch_list, target_list, options, run_metadata)
   1301                                    feed_dict, fetch_list, target_list,
-> 1302                                    status, run_metadata)
   1303

Since I don't like to use the tensorflow data loader, I can't help you here.

Anyway the score I got for nasnet was about 0.65, which seems a lot better than in the paper ( 0.612). Weird.

titu1994 commented 6 years ago

I'll take a look at the function. The generator seems to be yielding none values for some reason.

As to NASNet Large, seems even the K80 on Google cloud is having trouble of batch sizes above 8. That model is just gargantuan. I'm gonna stick to training the Inception ResNet V2 for a day. Let's see what it gets. Though even that is incredibly slow

titu1994 commented 6 years ago

Could you also share the final NASNet weights ? For EMD, lower is better, so perhaps it is simply the different validation set size that is causing this discrepancy.

Consider 5000 images with many hard images vs 50000 images with a large majority of easy images and some 5000 hard images. Simple averaging would suggest that the easy images would dominate the loss and reduce it a lot on the average.

tfriedel commented 6 years ago

I do get a very similar srcc value if I calculate it on the training set.

The nasnet weights you got are the 'final' weights so far. I stopped since it didn't improve.

I wonder if using a cycling learning rate could help with the slow training on the big networks. It's supposed to be like a magic weapon where you only need to train for 1-2 epochs to get amazing results. https://github.com/bckenstler/CLR

titu1994 commented 6 years ago

Cyclic learning rate works over several epochs, which would mean several hours of training

tfriedel commented 6 years ago

check this paper: https://arxiv.org/abs/1708.07120 they claim if the conditions are right, you only need 1/10th of the number of iterations to get the same results.

titu1994 commented 6 years ago

I've seen that paper. They could not replicate the phenomenon of super convergence on large datasets like ImageNet or even CIFAR 100. It's meaningless when the quantity of data is so large, and that too the task is a regression tasks rather than a classification task.

tfriedel commented 6 years ago

My experiment is finished, these are the results:

Nasnet mobile trained on images which where resized (preserving aspect ratio) and padded to 256x256 images, then randomly cropped to 224x224.

The Spearman's rank correlation coefficient (rho) was calculated on a validation set of 10.000 images.

Padded images, center cropped to 224x224 Rho: 0.6573

Padded images, scaled down to 224x224 Rho: 0.6514

This network was then finetuned (over 5 epochs, with decreasing learning rate) on the training set where the images were resized (no padding) to 256x256, then randomly cropped to 224x224.

Resized images, center cropped to 224x224: Rho: 0.6442

Resized images, scaled down to 224x224 Rho: 0.6481

So my conclusion is, padding or resizing doesn't make a big difference, but padding is slightly better. So I'm going with that.

qzchenwl commented 6 years ago

@tfriedel There's handmade data generator for keras if you don't like to use the tensorflow data loader https://github.com/qzchenwl/neural-image-assessment/blob/master/utils.py

titu1994 / neural-image-assessment

Wrong loss while computing mean. #3

With resizing to 224x224 -

If we don't resize the images, and run them at full resolution, these are the scores :