linjieli222 / HERO

Research code for EMNLP 2020 paper "HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training"
https://arxiv.org/abs/2005.00200
MIT License
230 stars 34 forks source link

I find a bug in svmr evaluation #29

Closed hgzjy25 closed 3 years ago

hgzjy25 commented 3 years ago
def post_processing_svmr_nms(
        svmr_res, nms_thd=0.6, max_before_nms=1000, max_after_nms=100):
    """
    svmr_res: list(dict), each dict is
        {"desc": str,
         "desc_id": int,
         "predictions": list(sublist)  # each sublist is
            [video_idx (int), st (float), ed(float), score (float)],
            video_idx is the same.
         }
    """
    processed_svmr_res = []
    for e in svmr_res:
        # the predictions are sorted inside the nms func.
        e["predictions"] = temporal_non_maximum_suppression(
            e["predictions"][:max_before_nms],
            nms_threshold=nms_thd)[:max_after_nms]
        processed_svmr_res.append(e)
    return processed_svmr_res
def temporal_non_maximum_suppression(predictions, nms_threshold,
                                     max_after_nms=100):
    """
    Args:
        predictions:
            list(sublist), each sublist is
            [st (float), ed(float), score (float)],
            note larger scores are better and are preserved.
            For metrics that are better when smaller,
            please convert to its negative,
            e.g., convert distance to negative distance.
        nms_threshold: float in [0, 1]
        max_after_nms:
    Returns:
        predictions_after_nms:
        list(sublist),
        each sublist is [st (float), ed(float), score (float)]
    References:
        https://github.com/wzmsltw/BSN-boundary-sensitive-network/blob/7b101fc5978802aa3c95ba5779eb54151c6173c6/Post_processing.py#L42
    """
    if len(predictions) == 1:  # only has one prediction, no need for nms
        return predictions

    predictions = sorted(predictions, key=lambda x: x[2],
                         reverse=True)  # descending order

    tstart = [e[0] for e in predictions]
    tend = [e[1] for e in predictions]
    tscore = [e[2] for e in predictions]
    rstart = []
    rend = []
    rscore = []
    while len(tstart) > 1 and len(rscore) < max_after_nms:  # max 100 after nms
        idx = 1
        while idx < len(tstart):  # compare with every prediction in the list.
            if compute_temporal_iou(
                    [tstart[0], tend[0]],
                    [tstart[idx], tend[idx]]) > nms_threshold:
                # rm highly overlapped lower score entries.
                tstart.pop(idx)
                tend.pop(idx)
                tscore.pop(idx)
            else:
                # move to next
                idx += 1
        rstart.append(tstart.pop(0))
        rend.append(tend.pop(0))
        rscore.append(tscore.pop(0))

    if (len(rscore) < max_after_nms
            and len(tstart) >= 1):  # add the last, possibly empty.
        rstart.append(tstart.pop(0))
        rend.append(tend.pop(0))
        rscore.append(tscore.pop(0))

    predictions_after_nms = [
        [st, ed, s] for s, st, ed in zip(rscore, rstart, rend)]
    return predictions_after_nms

e['predictions'] in post_processing_svmr_nms has four elements [video_idx (int), st (float), ed(float), score (float)] each sublist, and the function temporal_non_maximum_suppression requires argument predictions with three elements [st (float), ed(float), score (float)] each sublist. So tstart, tend, tscore will be video_idx, st, ed, respectively. I fix it by myself and test the code. But it seems that this problem have no effect on vcmr results?

hgzjy25 commented 3 years ago

So the Moment Retrieval results presented in table 5 is wrong, R@1 on TVR dataset should be up to 15%.

linjieli222 commented 3 years ago

Hi @hgzjy25,

Thanks! It seems that you are right. We directly adopted this part of the code from the official release of TVRetrieval without further validation. Sorry about that. We have double confirmed with the original authors of TVR and they have it fixed here: https://github.com/jayleicn/TVRetrieval/commit/aec36bc3aa63d22ec5137b1f4a320257caf4d0dd, which will be reflected in HERO shortly.

VCMR should not be affected, as the results are preprocessed first by https://github.com/linjieli222/HERO/blob/f938515424b5f3249fc1d2e7f0373f64112a6529/utils/tvr_eval_utils.py#L132

linjieli222 commented 3 years ago

fixed with https://github.com/linjieli222/HERO/commit/cc99e0417c779e6d50eaacce41d4292dc9cd9368

hgzjy25 commented 3 years ago

Thank you for your quick reply and action. But there still remains a mistake. _video_id = e["predictions"][0] should be _video_id = e["predictions"][0][0], otherwise _video_id will be the first sublist in e["predictions"] instead of the real video idx.