google-research / scenic

Scenic: A Jax Library for Computer Vision Research and Beyond
Apache License 2.0
3.33k stars 437 forks source link

There is no asr data of activitynet #761

Closed kimseongah closed 1 year ago

kimseongah commented 1 year ago

Hello, @antoyang

I'm trying to build an activitynet dataset for the vid2seq model. So while looking at your repository(FrozenBiLM), I noticed that activitynet's asr data is on Google Drive as a subtitles.pkl file. here

When I opened the pikle file, there was a dictionary as below and there was no vtt files as explained.

To download automatic speech subtitles, we use youtube-dl, except for LSMDC, How2QA and TVQA for which the authors provide them. We then convert the vtt files for each video from a dataset to a pickle file subtitles.pkl containing a dictionary mapping each video_id to a dictionary containing a start, end and text key, corresponding to the speech in the corresponding video_id.

Without timestamp, there were only videos and sentences of activitynet, and the number was very small at 774. like this {'obVMUmZQW_M': "hi today I'm going to show you how to take up some curls from your hair so what I've already done is I've used of hot rollers as in like the Connor iron shine and I've rolled it in my hair I've actually individually boost my hair so that it can actually absorb the curls that I've put in so what I used to do is take up these little things that helps for all here and I do is I follow the curl just like so okay so I'm going to take out the rest of them as you can see it actually happens so following the curl now what I love about these hot rollers is that doesn't damage your hair as much as the actual hot iron it actually also gives a lot of volume for those of us who do not have a lot of voiding my hair's like flat Oh you mean is actually quite quick and easy because you can just turn it on heat it up put the role is in hair brush your teeth and it should be done by the time time so this is what you get just really big curls and now you can do is you can run your fingers through it if you like the more messy look this is a more defined kind of look right now so I can do is just run your fingers through it just break up some of the curls to make it a bit more natural and flowy like so and there you go there you have my little relaxed look with the curls thanks for watching"} Is there a way to open this pickle file differently?

I need asr data of activity net for fine-tuning vid2seq model. If possible, could you release the dataset? In addition, we are eagerly awaiting the release of the pre-trained model.

Thank you.

antoyang commented 1 year ago

Hi, indeed the ASR data released in the Just Ask and FrozenBiLM projects are provided without timestamps. These are not meant to be used directly for the Vid2Seq project, as the videos of ActivityNet-QA do not cover all videos from ActivityNet Captions anyway. However, the ASR processing scripts from the Just Ask project may be used in the pipeline, which requires (i) download the vtt file for all videos e.g. using youtube-dl or equivalents (ii) process the ASR (by removing repetition and merging by sentences), e.g. with the Just Ask scripts (iii) put this processed data into the TFRecords for using this codebase. Note that if you want to quickly have a working model for ActivityNet Captions, you may not need ASR (see Table 2 in the Vid2Seq paper -- results of the visual-only model are just as good on this dataset). Also note that it is normal that ActivityNet Captions videos often do not have any (or have little) ASR data, as discussed in the Vid2Seq Appendix B.1. I cannot release the ASR data unfortunately, but I am doing everything I can to release the model checkpoints as soon as possible.

kimseongah commented 1 year ago

Thank you for replying :)👍