strange behaviours on validation data

scihacker commented 6 years ago

I found the loss decreasing when the network is passing through validation data. So I checked the code and found:

# train.py
for phase in ['train', 'val']:
    num_step = config['train']['{}_iters'.format(phase)]
    generator = data_func(phase)
    print('start', phase, config['opt'].exp)

    show_range = range(num_step)
    show_range = tqdm.tqdm(show_range, total = num_step, ascii=True)
    batch_id = num_step * config['train']['epoch']
    for i in show_range:
        datas = next(generator)
        # phase is 'train' or 'val'
        outs = train_func(batch_id + i, config, phase, **datas)

The trainer is built in the task path: task/pose.py, where

def make_train(batch_id, config, phase, **inputs):
    for i in inputs:
        inputs[i] = make_input(inputs[i])

    net = config['inference']['net']
    config['batch_id'] = batch_id

    if phase != 'inference':
        # No phase variable is used here.
    else:
        out = {}
        net = net.eval()
        result = net(**inputs)
        if type(result)!=list and type(result)!=tuple:
            result = [result]
        out['preds'] = [make_output(i) for i in result]
        return out

So in my opinion, data in validation phase is also trained as those in the training phase. Could you explain this to me, please? Thanks.

ahangchen commented 6 years ago

I agree with your opinion and I found there is already a PR to fix it. But I'm also confused whether it's fair to train with val data.

scihacker commented 6 years ago

PR fixed the bug. Close the issue.

anewell commented 6 years ago

Wanted to thank you for pointing this out! We tried to get the PR up as quickly as possible - though I forgot to merge it.

Just to clarify, this was something that accidentally slipped through when we were refactoring for the public code release, and we didn't train on validation data on the experiments in our paper.

Yishun99 commented 6 years ago

Have anyone got the new mAP on validation data?

ahangchen commented 6 years ago

@DouYishun I found that in the latest leaderboard of COCO2017, umich-vl method only scores 0.46 in AP.

http://cocodataset.org/#keypoints-leaderboard

hellojialee commented 6 years ago

@ahangchen Hi! Why did this happen? If the refinement step is removed, what AP will be? Thank you.

princeton-vl / pose-ae-train

strange behaviours on validation data #11