jayleicn / TVRetrieval

[ECCV 2020] PyTorch code for XML on TVRetrieval dataset - TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval
https://tvr.cs.unc.edu
MIT License
151 stars 24 forks source link

Implementation details of the DiDeMo experiments with the XML baseline (your method) #5

Closed okisy closed 3 years ago

okisy commented 3 years ago

Hi there! Thanks for sharing your great work. It seems you conducted experiments with the DiDeMo dataset without using subscript information to check the performance of your method. I have a couple of questions to ask you about it.

  1. clip length of the input features (in this case ResNet) In the main experiments in your paper, TVR features are divided and fed into the model with the clip length of 1.5 sec. Is it also the case with the DiDeMo dataset? Or did you treat the feature in a different way from the TVR dataset?

  2. how to deal with the timestamp information at the time of both training and inference (for training, also about tef) In the DiDeMo dataset, the moment timestamp information are given in the form of index (0-5). Did you translate it into the form of seconds, i.e., (0 sec - 30 sec)? Or did you use the index as the timestamp information as it is?

If there is any information I missed about the didemo dataset, please also let me know. Thank you in advance!

jayleicn commented 3 years ago

Hi @okisy,

Thanks for your interest in our work! Yes, we had DiDeMo as one of our experiments. Fo your questions:

  1. We use ResNet-152 features max-pooled across a 2.5-seconds clip, which is half of the DIDeMo segment length of 5 seconds. At inference time, we constrain the model to output st/ed times as multiples of 5, rather than 2.5.
  2. Yes, because these indices are associated with exact 5-seconds long segments, we can directly map them to seconds, e.g., index [1, 1] --> seconds [5, 10].

Hope it helps! Jie

okisy commented 3 years ago

Hi @jayleicn , Thanks for your quick response. To understand your settings more clearly, I want to ask you a few more questions.

1. Details about the way of transforming from the index information. The didemo dataset includes a video with less than 30 sec. For instance, let's say we have a 28-sec video. My question is if the index is given as [5, 5], how did you associate it with the second-level information? Did you simply regard it as [25 sec, 30 sec]? Or by using the clip-length information, did you associate it like [25 sec, 27.5 sec]? I misunderstood your answer. Please ignore this question.

  1. Whether or not there is a code about the transformation in your repository. This question is quite simple, asking you if there is a code about the first question to check what is performed specifically. Anyway, I want to ask you if there are training, inference, or evaluation codes with the XML for the didemo in your repository to see what is performed.

  2. The effect of clip-length on the results I found the clip-length information you told me is based on the config here ( https://github.com/jayleicn/TVRetrieval/blob/34777b4bf9814feb04ded89668e2b0b4e432cc1b/baselines/clip_alignment_with_language/local_utils/proposal.py#L116 ). I guess this config is for the reproduction of the paper "Temporal Localization of Moments in Video Collections with Natural Language". Did you try other clip-length for the datasets including the TVR dataset? If so, can you tell me how big the effect is? If not, is it ok for me to think that you simply tested the XML according to the above paper settings?

Thanks! Sho

jayleicn commented 3 years ago

Hi Sho,

  1. The DiDeMo code is not available in this code, but this codebase should be directly usable if you prepare DiDeMo features and data files in the same format as TVR. The evaluation code actually supports DiDeMo evaluation: https://github.com/jayleicn/TVRetrieval/blob/master/standalone_eval/eval.py#L154. Meanwhile, you can also follow the same process to make this codebase work for ActivityNet and CharadesSTA, if you are also interested. We tried them briefly when we start this project.

  2. Yes, you are right! It was used to reproduce CAL. We did not try other clip lengths, it might be interesting to investigate how it would affect the final performance, definitely let me know if you found something here :). Yes, we simply tested XML according to the paper settings, we did not really tune many of the hyper-parameters. I guess there is still a lot to improve with the current settings.

Best, Jie

okisy commented 3 years ago

Thank you for you kindness. I'll take a look.

okisy commented 3 years ago

Hi Jie,

At inference time, we constrain the model to output st/ed times as multiples of 5, rather than 2.5.

Is this operation implemented in your code in this repository? I tried to run your code with the didemo dataset, but the evaluation about SVMR and VCMR is clearly going wrong (extremely low), while the VR seems working well. I think this is because processing about the didemo is lacking as it is.

Yes, because these indices are associated with exact 5-seconds long segments, we can directly map them to seconds, e.g., index [1, 1] --> seconds [5, 10].

Also, I am wondering where this mapping is implemented.

If I need to implement it myself, can you tell me to which part I should add lines? Best, Sho

jayleicn commented 3 years ago

Is this operation implemented in your code in this repository?

Nope, but it should be easy to implement by directly masking out these entries that are not a multiple of 2. (Given we are using 2.5 seconds clip, 5 seconds segment would be a multiple of 2 for the indices.)

For example, you can create a mask to zero out these entries use the following function:

def get_didemo_mask(array_shape):
    """
    indices can only be located at even positions, not odd ones.
    As we use 2.5s segments for DiDeMo while it is annotated with position of divisible by 5 seconds.
    e.g., [0, 5, 10, 15, ...] We need to make sure our generated indices also fall into these ones.
    Args:
        array_shape: np.shape??? The last two dimensions should be the same
    """
    n_single_dims = len(array_shape) - 2
    mask_shape = array_shape[-2:]
    mask_array = np.ones(mask_shape, dtype=np.float32)  # (L, L)
    mask_array[1::2, :] = 0
    mask_array[:, 1::2] = 0
    return mask_array.reshape((1, ) * n_single_dims + mask_shape)

and multiply the generated mask to st_ed_prob_product here: https://github.com/jayleicn/TVRetrieval/blob/master/baselines/crossmodal_moment_localization/inference.py#L221

Also, I am wondering where this mapping is implemented.

Nope, it is not implemented in this repo, it is a pre-processing step which should be done before the data is loaded by the dataloader.

okisy commented 3 years ago

I really appreciate your quick response. Thanks to your advice, I managed to run your code with more decent scores. However, the scores obtained are a bit lower than those reported in the paper, e.g., more or less 30.0 for VCMR-0.5 R@100, 17.0 for VCMR-0.7 R@100. Did you use hyperparmeters different from those for the TVR? Additionally, if you have any idea about this performance degradation, please tell me.

jayleicn commented 3 years ago

Hi Sho,

We also added the following flags: --max_ctx_l 12 --min_pred_l 2 --max_pred_l 4. All the rest of the configurations should be the same as TVR. Let me know whether this works for you.

Best, Jie