google-research / scenic

Scenic: A Jax Library for Computer Vision Research and Beyond
Apache License 2.0
3.18k stars 421 forks source link

[Vid2Seq] Reproduction of the paper results #803

Open dreamgonfly opened 1 year ago

dreamgonfly commented 1 year ago

I followed the instructions in README to evaluate the released checkpoints, but I could not reproduce the results on the paper.

The paper says a fully fine-tuned Vid2Seq achieves 7.9 SODA_c, 47.1 CIDEr, 9.3 METEOR on YouCook2, and 5.8 SODA_c, 30.1 CIDER, 8.5 METEOR on ActivityNet (Table 5). However, the numbers I got by re-running the code were much lower than the results on the paper (around 20 CIDEr score on YouCook2)

Could you share how I can reproduce the results?

Below are important steps from how I tried to run the evaluation code.

First, I preprocessed data as follows:

Second, I evaluated released checkpoints as follows:

antoyang commented 1 year ago

What model/tool do you use to extract speech transcripts?

dreamgonfly commented 1 year ago

@antoyang To extract speech transcripts, I used Whisper model (base). The ASR results seem okay, but I still cannot reproduce the results from the paper.

For YouCook2, the highest score I could get with the released fine-tuned checkpoint was CIDER 10.9 (47.1 from paper), METEOR 4.3 (9.3 from paper), and SODA_c 2.6 (7.9 from paper).

I0618 16:51:54.288763 139967128065856 trainer.py:915] Finished gathering eval metrics for 413 samples
I0618 16:51:54.290117 139862179014400 logging_writer.py:48] [0] validation/CIDER=0.109533, validation/F1_Score=0.159727, validation/METEOR=0.0432585, validation/Precision@0.3=0.368769, validation/Precision@0.5=0.182183, validation/Precision@0.7=0.063885, validation/Precision@0.9=0.00668596, validation/Precision_Mean=0.155381, validation/Recall@0.3=0.43129, validation/Recall@0.5=0.213194, validation/Recall@0.7=0.0730633, validation/Recall@0.9=0.0075032, validation/Recall_Mean=0.181263, validation/SODA_c=0.0268965, validation/n_preds=8.00726

Below are a few sample input csv data.

"video_id","duration","caption","start","end","asr_string","asr_start","asr_end"
"fn9anlEL4FI","490300000","['add garram masala seeds and a bay leaf to the oil', 'add the lamb to the pot', 'add garlic ginger paste and chopped onions to the pot', 'add chili tumeric coriander cumin and salt', 'add water to the pot', 'add potatos to the pot', 'add the tomatos to the pot', 'add chili to the pot']","[30000000, 69000000, 136000000, 170000000, 230000000, 309000000, 383000000, 438000000]","[39000000, 86000000, 149000000, 183000000, 238000000, 333000000, 390000000, 443000000]","[""Welcome back once again to how to cook great food.com. If you haven't already, click that"", 'button and subscribe to our channel. Only make it today, you can be making a lamb and potato', ""curry or masala. As you can see I've got my pan here and in there I've got some oil"", ""that's heating up nicely. I'm using a sunflower oil, go ahead and use any oil you like."", ""We're going to drop in some whole seeds or garam masala. So here they go. We want them"", ""to roast on pop and crackle. There's a bay leaf here. I've got in there some fennel seeds,"", ""cumin seeds, green cardamom, and black mustard seeds. That's what I'm using today for this."", ""They're going to release a wonderful flavour into that oil. Now we're going to go in with our lamb."", ""We're going to fry this for about five or six minutes just with the whole garam masala."", 'Here we go. This lamb has got burning. You can use chilli if you want.', ""So let's just cook this. Let's say it's got about five or six minutes."", ""Stir it over. I'm going to kind of above medium heat. We'll just see it a little bit."", ""Then we're going to add lots of other lovely spices."", 'You can see that our meat is browning really nicely. I mean it is no any accrued.', ""That's what we've got to do now is to get this meat nice and tender."", ""What I do here is I'm going to add some garlic ginger paste. That's a 50-50 mix of garlic and ginger."", ""It's about three of these little teaspoons in there. I'm going to add some chopped onions."", ""I'm using a red onion but go ahead and use white."", ""Then we're going to add some powders. As always if you've watched the channel I call these the big four there equal parts of chilli, coriander, cumin and turmeric."", ""If you'd like of course you can use your favourite curry powder. We're going to add some salt at this stage."", ""Let's flip this over."", ""We're going to cook this for about now three or four minutes. Turn it constantly."", 'Again on a kind of above medium heat.', ""We've got some lovely flavours happening now."", ""Now we're going to add some water."", 'That was cold water by the way.', ""We're just covering it a little bit."", ""We're going to bring this water to the boil and then we're going to simmer this with a lid on."", 'For about 15 minutes this is the part that I hope generally works.', ""We'll tend to as I meet them make it nice and soft."", ""So let's take a look now."", 'Look at that steam out of there.', 'This is cooking down beautifully. As you can see look at that.', 'The needs come straight up of that bone.', ""It's certainly on its way now."", 'A pretty essential part of doing this dish is to get your meat nice and tender.', ""You're getting that with an awful tough meat."", 'Now made of what cut you use.', ""Maybe you put really expensive lamb but it would still end up being tough if you don't go for this process."", ""I'm now going to add some potatoes."", ""We've tough peeled and chopped."", 'These are fairly small.', 'You cut them however you like and the cooking process will obviously take a longer time if you put them in as much bigger.', ""So again let's give this a mix."", 'Stir them in.', ""We've still got a decent amount of moisture in there from that water."", ""If you haven't at this point maybe you've got to really dry."", 'Add a bit more water now.', ""It's going to go back on."", ""I'm going to cook this for about 78 minutes on a fairly low heat."", 'Not a simmer, above a simmer.', ""Okay let's jump in now and take a look."", ""Let's look in more like it."", 'The potatoes are cooking very nicely.', 'I kind of like my potatoes quite soft.', ""I'm now at this stage going to add some chopped tomatoes."", ""I'm just going to spread them on the top."", ""I'll put the lid back on."", ""On a fairly low heat we're going to cook them just for about five minutes."", 'What they should do is break down with the steam.', ""Don't stir them at the moment."", 'The steam will break them down.', ""We're going to mix it around once and come back."", 'We may add a little tad more water perhaps.', ""And then we're pretty much done."", 'We should be now at the final stage.', 'Yeah these are soft and really nice the as you can see.', ""And they've given off a little bit of moisture as well."", 'Just now turning it over.', ""At this stage I'm going to add some fresh chilli."", ""It's totally optional as to how much you're putting."", ""I'm putting about four or five there."", 'You now need to check this for salt.', ""It's all good for me."", 'You can if you want finish that off with some fresh coriander or cilantro.', ""Let's just cook that for about two more minutes and it's done."", ""It's wonderful."", ""I'm really happy with it."", ""I'll see you again soon."", 'Take care.', 'Thank you.']","[0, 9800000, 14840000, 22140000, 26800000, 33800000, 41800000, 57300000, 61800000, 68800000, 86800000, 101800000, 109800000, 120800000, 127800000, 132800000, 139800000, 146800000, 161800000, 173800000, 185800000, 196800000, 205800000, 219800000, 226800000, 235800000, 241800000, 248800000, 254800000, 259800000, 266800000, 269800000, 273800000, 278800000, 282800000, 287800000, 292800000, 296800000, 298800000, 305800000, 309800000, 314800000, 318800000, 333800000, 337800000, 346800000, 350800000, 353800000, 358800000, 361800000, 366800000, 368800000, 372800000, 375800000, 378800000, 382800000, 388800000, 391800000, 393800000, 396800000, 399800000, 401800000, 403800000, 405800000, 408800000, 411800000, 415800000, 418800000, 423800000, 435800000, 438800000, 441800000, 446800000, 453800000, 456800000, 461800000, 464800000, 465800000, 467800000, 468800000, 469800000]","[9800000, 14840000, 22140000, 26800000, 33800000, 41800000, 48800000, 61800000, 68800000, 76800000, 101800000, 108800000, 115800000, 127800000, 131800000, 139800000, 146800000, 152800000, 173800000, 185800000, 192800000, 204800000, 211800000, 223800000, 231800000, 241800000, 246800000, 254800000, 259800000, 266800000, 269800000, 273800000, 278800000, 282800000, 286800000, 292800000, 296800000, 298800000, 305800000, 309800000, 313800000, 318800000, 333800000, 337800000, 340800000, 350800000, 353800000, 358800000, 361800000, 366800000, 368800000, 372800000, 375800000, 378800000, 382800000, 387800000, 391800000, 393800000, 396800000, 399800000, 401800000, 403800000, 405800000, 408800000, 411800000, 414800000, 418800000, 423800000, 426800000, 438800000, 441800000, 446800000, 450800000, 456800000, 461800000, 464800000, 465800000, 467800000, 468800000, 469800000, 471800000]"
"-dh_uGahzYo","561490000","['mix hanger chili powder ginger powder fennel powder and water', 'add cumin seeds green cardamom cinnamon sticks to a blender', 'heat some ghee in a pan', 'add the black cardamom to the pan', 'add the mutton to the pan', 'add the mixture', 'season with salt and cover the pot', 'add the blended spice to the pot', 'cover the pot']","[105000000, 125000000, 138000000, 146000000, 183000000, 224000000, 247000000, 334000000, 381000000]","[120000000, 132000000, 145000000, 148000000, 196000000, 230000000, 259000000, 345000000, 383000000]","['Hello, Namaste, Salamwalekum sastriya kal.', 'Welcome back to another session with your watch of at warawa.com.', 'Today I am going to show you another favorite of mine.', 'I am very surprised while I was checking the list of the dishes I did.', 'I did not make Mutton Rogen Josh.', 'Dear friends, this is one of the tastiest and super awesome dish from Kashmir.', 'You know, this is the dish what I learned from the master chefs only in five style hotels.', 'But I have seen in lot of restaurant they serve Mutton Rogen Josh.', 'They just serve the Mutton Curry and call it Rogen Josh and he does not have the punch', 'what Rogen Josh must have.', 'You know, a lot of people add onion, tomato and all this in making me Rogen Josh.', 'But what I am going to do today, I am not going to add onion or tomatoes nor even yogurt.', 'You know, if you want you can add a little bit of yogurt but I am not even going to add yogurt.', 'So for this the spice is what we are going to add is Javitri that is Mace, Cinnamon, Green Cardamom,', 'Cumin seeds, Black Cardamom and Saferan.', 'You know, I am going to make a powder of these four and add while as I black illaji or the black cardamom', 'and cook the meat with it.', 'Now here I have got one end of table spoon of chilli powder.', 'Not any chilli powder, Kashmiri chilli powder.', 'That is what will give nice red colour, ginger powder but one table spoon of final seed powder.', 'And we are going to add this in this quantity and that will give a very nice tasteful gravy.', 'Now to make it very simple I am going to mix all of these masalas together so you will understand.', 'So here I have got hing powder.', 'You know, hing is a must for Mutton Rogen Josh and in this add Kashmiri chilli powder.', 'Ginger powder and final seed powder.', 'And in this add water and mix this into a watery paste.', 'And now we are going to add in a blender.', 'I am going to add the cumin seeds, green cardamom, cinnamon sticks and Javitri that is Mace.', 'I am going to powder and add it.', 'So make it in a nice coarse powder.', 'You know, you are not going to cook it in the oil.', 'This Mutton Rogen Josh needs to be cooked in nice desi ghee.', 'When this desi ghee heats up we are going to add badi illachi that is black cardamom.', 'That will give a nice flavour to this dish.', 'And here I have got meat.', 'This is a nice lamb meat and all these meats have bone.', 'Nally that is the shanks of meat.', 'And take all the pieces which are like shanks.', 'And when these get cooked like this with the bone in, the gravy becomes nice and very', 'flavourful.', 'Now here the ghee is heated up and my black cardamom is nicely roasted in this add pieces', 'of this meat.', 'And we are going to cook this meat in this ghee.', 'And you have to cook in the meat becomes slightly brown.', 'That is when you get a very good flavour to the gravy.', 'You know, it is better always to fry the meat like this and then cook it on a slow', 'flame.', 'You know, now look at this meat.', 'This is nicely slightly brown and you know, this method of cooking is used not only in', 'India but throughout the world.', 'When you roast the meat like this, it is called Milad effect.', 'What it does is it caramelizes the outer coating of meat and gives a very nice flavour to', 'this dish.', 'Now this is all ready.', 'Now in this, you are going to add the mixture, the paste of the chili powder and soft', 'powder, fennel powder into this.', 'And you can also add little of saffron.', 'You know, this will also give a very nice flavour to this dish and pour in a lot of water', 'to cover the meat.', 'You know, because I wanted to show you, I am cooking in such a big pan.', 'Otherwise I would have taken a little smaller pan like this but you know, to make sure that', 'you see what is happening in the pan I took a vessel like this.', 'Now put the lid on and cook it on a slow flame for at least one hour to one and a half', 'hour.', 'Another easy method of, if you do not have patience to spend one and a half hour of slow cooking,', 'easy method is just pour this into a pressure cooker, cook it and again transfer it back', 'in this pan because you want this masala also to be cooked.', 'In a slow method of cooking like this, what it does is it evaporates the water because', 'we added little extra water in this, that water will be overrated and when the sauce is done,', 'it has to be liquidy but all the masala needs to be cooked.', 'And here is the masala of cumin, cardamom, cinnamon sticks.', 'And when you are cooking in an open method like this, when this is cooked for like half', 'of the time that is almost 45 minutes, then we are going to add this.', 'But if you do in the pressure cooker, you will have to add it after the meat is cooked.', 'After cooking for almost 45 minutes, now look at this gravy, this is nice, the oil is', 'also slightly floating on top and you can see this meat.', 'The lamb bones were not visible when we started but look at this.', 'Now after 45 minutes, they coming off the bone, that is when you know that the meat is', 'getting nicely cooked.', 'Now here is the masala powder of a maze that is Javitri, cinnamon, cardamom, cumin and', 'all this and then we are going to add to this.', 'This is what will give a nice flavor to your Rogen Josh.', 'Just add all of this, mix it and we are going to cook this for another 30 minutes at least.', 'Till the time the meat is become nice and tender, the meat should be so much cooked that', 'it should be coming off the bone and also when it is properly done, this meat will literally', 'melt in your mouth, that is when you got a perfect and a super awesome tasty Rogen Josh.', ""So dear friends, you don't need to add curd, no tomatoes, no onions."", 'Just with this masala, you will not believe how much awesome flavor this is already giving.', 'So let me put the lid on and if you need to add little water, you can keep adding little', 'water till you get the desired consistency.', 'After cooking it for almost another 30 minutes, the flavor of Rogen Josh has spread all', 'over and you can see how the Rogen means, this oil that is floating, red in color, look', 'at this.', 'That is what makes this awesome dish super to look at and tasty also and wow, you know', 'if you make it right, this will taste super fantastic.', 'It is so good.', 'Trust me, make it the way I have shown you and it will be super fantastic.', 'Dear friends, this is something magical, this is something super awesome.', 'But you use nice lamp shanks to make it and take it easy on the ginger powder and you', 'will get nice perfect Rogen Josh.', 'You know while we were in the college, they always used to add Ratanjog.', 'But when I was in the industry, they told no, no, no Ratanjog in this, the color should', 'come from the Kashmiri chillies, the flavor, the aroma should come from saffron and some', 'people also add coxcom is a kind of a flower which gives also a nice coloring.', 'Some people add that but for me, this is super perfect.', 'I am going to switch off the flame and I am going to enjoy it hot along with my non-vav.', 'Now look at this Rogen Josh.', 'Wow, what flavors along with this?', 'So much perfectly cooked and you know, especially when I cook meat like this, I want the meat', 'to be fully in my s and tanda but still retains some of the pink color.', 'As a reason why I did not add turmeric but some people add if you want, you can add', 'turmeric also but dear friends, wow.', 'You know, eat with basmati rice or nice mughalai, non like the one what I am eating.', 'This is super and the nice sauce is also nice sticky and super tasty.', ""Dear friends, I hope you enjoyed today's session of learning how to make this awesome"", 'Mutton Rogen Josh from Kashmir but do not forget, Vahrehvah is all about inspiring', 'others to cook.', 'So please post your recipes and cooking tips at vahrehvah.com.', 'So others can benefit from your great cooking.', 'Thank you.']","[0, 10840000, 14560000, 17440000, 20840000, 24000000, 29720000, 34519999, 38360000, 43120000, 45200000, 50200000, 56560000, 60400000, 67640000, 71040000, 78120000, 79720000, 84360000, 87360000, 93880000, 98800000, 104600000, 107000000, 114000000, 119160000, 124600000, 127560000, 134240000, 136200000, 139720000, 141920000, 145880000, 150760000, 153600000, 156800000, 161760000, 164640000, 167200000, 172920000, 173920000, 183200000, 184200000, 188320000, 192840000, 196760000, 201300000, 202300000, 204720000, 210840000, 212640000, 216520000, 221440000, 222440000, 224240000, 231680000, 234520000, 238280000, 246280000, 247280000, 252000000, 256839999, 259880000, 265399999, 266399999, 271240000, 276480000, 281080000, 285080000, 290200000, 293520000, 298280000, 302760000, 306320000, 312320000, 318640000, 322880000, 326640000, 331039999, 332560000, 340320000, 343520000, 347560000, 354120000, 359479999, 364799999, 371760000, 376640000, 383000000, 387600000, 390760000, 396800000, 405120000, 406280000, 416480000, 425760000, 426760000, 434760000, 444120000, 450560000, 454000000, 458600000, 464400000, 469760000, 475960000, 480320000, 487320000, 492760000, 503240000, 509240000, 514840000, 519480000, 522880000, 531160000, 538320000, 542840000, 547520000, 548520000, 551760000, 554000000]","[10840000, 14560000, 17440000, 20840000, 24000000, 29720000, 34519999, 38360000, 43120000, 45200000, 50200000, 56560000, 60400000, 67640000, 71040000, 78120000, 79720000, 84240000, 87360000, 93880000, 98800000, 104600000, 107000000, 114000000, 119160000, 124600000, 127560000, 134240000, 136200000, 139000000, 141920000, 145880000, 150760000, 153600000, 156800000, 161760000, 164640000, 167200000, 172920000, 173920000, 183200000, 184200000, 188320000, 192839999, 196760000, 201300000, 202300000, 204720000, 210840000, 212640000, 216520000, 221440000, 222440000, 224240000, 231680000, 234520000, 238280000, 246280000, 247280000, 252000000, 256839999, 259880000, 265399999, 266399999, 271240000, 276480000, 281080000, 285080000, 290200000, 293520000, 298280000, 302760000, 306320000, 312320000, 318640000, 322880000, 326640000, 331039999, 332560000, 340320000, 343520000, 347560000, 354120000, 359400000, 364799999, 371760000, 376640000, 383000000, 387599999, 390760000, 396800000, 405120000, 406280000, 416480000, 425760000, 426760000, 434760000, 444120000, 450560000, 454000000, 458600000, 464400000, 469680000, 475960000, 480320000, 487320000, 492760000, 503240000, 509240000, 514840000, 519480000, 522880000, 531160000, 538320000, 542840000, 547520000, 548520000, 551760000, 554000000, 554360000]"
"BktdaTg6_E4","371900000","['mix vegetable oil salt and curry masala', 'marinate the lamb in a ziplock bag', 'season the lamb meat with salt', 'bake the lamb meat in an oven', 'blend garlic ginger cherry and onion and water', 'heat some clarified butter in a pan', 'add chopped onion and salt and saute', 'mix some cumin cinnamon black pepper and paprika', 'add the mixed spices the mixture and the lamb in']","[30000000, 62000000, 88000000, 91000000, 99000000, 123000000, 134000000, 156000000, 183000000]","[57000000, 75000000, 90000000, 98000000, 118000000, 133000000, 155000000, 172000000, 252000000]","['Hello, this is Chef John from Foodwishes.com with Lamb Shank Vindaloo.', ""That's right, I get a lot of complaints."", 'How come you never do Indian food?', ""It's because I'm scared."", ""I don't have a lot of experience with it."", 'I love to eat it.', 'But I thought I would give this one of my favorites to try.', 'This very spicy lamb type curry dish.', 'So I hope I got it close.', 'You Indian cuisine experts will be the judge.', 'So here we go.', ""So step one here, I'm going to put four lamb shanks in a plastic bag."", 'You need to get marinated overnight before we start the dish.', ""So I'm going to place those in."", ""And then into a bowl, I'm going to pour some cider vinegar, some vegetable oil, some salt,"", 'and then something called tamarind.', ""I'm using a tamarind concentrate."", ""And we'll talk a little bit about that on the blog."", ""But it's a very tart, sour kind of citrus-like ingredient."", 'All right, I started mixing that up and then I realized I never put the garmasala in,', 'which is a blend of Indian spices.', ""We've used that before."", 'We like it.', ""All right, so I'm going to mix that in and that's basically the marinade."", ""So we're going to pour that over the lamb shanks."", ""We're going to seal up that bag really well."", 'All right, just to confuse you, I put mine in a second bag as I thought I had a leak.', ""We're going to squeeze out as much air as possible so the meat is immersed in the marinade."", ""And then we're going to put that in the fridge overnight."", 'Not a bad idea to turn it over once in a while.', ""All right, the next day I'm going to pull it out of the bag."", ""I'm going to place it on an oiled foil lined sheet pan."", ""Don't throw away the marinade, by the way."", ""That's going in the stew later."", 'So just reserve the marinade.', ""I'm going to salt those generously on both sides."", ""And we're going to brown those in a very hot oven for 50, for 15 or 20 minutes until"", ""they're nice and brown."", ""We're going to pull those out and reserve them till needed."", ""Next up in a blender, we're going to add a lot of garlic, a lot of ginger, some cherry"", 'tomatoes, a nice big onion, and a little bit of water.', ""We're going to pulse that on and off until we have a nice smooth puree."", 'And it kind of looks like a delicious strawberry smoothie.', ""And yet it's so the opposite of that."", 'So just set that aside.', ""And it's back over to this stove where we're going to start the actual vindaloo."", ""So we're going to put a heavy Dutch oven on medium high heat."", ""And I'm going to put in some clarified butter."", 'Now this is supposed to be something called ghee, which is basically a clarified butter.', 'But clarified butter will work.', ""All right, so I'm going to put my butter in."", ""I'm going to throw a roughly chopped onion in there with a big pinch of salt."", ""And we're going to brown this."", ""And I'm not talking golden brown."", ""I'm talking almost golden black."", ""That's going to add sweetness and a depth of color to the sauce."", 'So just keep cooking them.', ""And right there you're thinking, that's probably good."", ""It's not."", 'Let them go further.', 'OK?', ""While those are browning, I'm going to get my spice blend together, which is cumin, cinnamon,"", 'black pepper, cayenne, and a lot of it, dry mustard, and paprika.', 'OK?', 'And all that will be on the blog, of course.', ""All right, we're going to go back over the stove, check the onions, and now we're talking."", ""That's what we want."", 'Nicely browned, very dark edges.', 'Perfect.', ""And at that point, we're going to back the heat down to medium and dump in the spices."", ""And we're going to kind of toast the spices in that hot butter."", 'And that really wakes up the flavor, and it really, really adds an extra dimension, which', ""I guess would be the fifth dimension, but who's keeping track?"", 'Not only needs to cook for about two minutes, but it really does make a difference.', ""All right, after that, we're going to go ahead and dump in the marinade that was left"", 'over from the bag of lamb.', 'All right, remember that was the cider and the tamarind and the oil.', ""All right, so I'm going to dump that in."", ""And then we're going to dump in the mixture from the blender, the onion, the tomato, the"", 'ginger, the garlic.', ""We're going to give that a stir."", ""We're going to raise the heat up to high."", 'We want to bring this up to a simmer.', ""And before we put the lamb back in, we're going to go ahead and add a little bit of brown"", 'sugar.', 'Just to balance out that acidity and heat, all right, so stir that in.', 'And then we can place our lamb back in.', ""And if you're using a similar sized pot, you should have enough liquid to just almost"", 'come up to the top.', ""It doesn't have to be totally covered."", ""This is going to stew for three hours, and we're going to turn these several times."", ""So as long as you have that much liquid, you're okay."", ""If you need that, another splash of water, don't be afraid."", ""Don't forget you can always reduce sauces later."", 'So once the lamb goes in, I want you to turn the heat down to low.', 'I want you to cover it tightly, and I want you to simmer that very slowly on very low', 'heat for about three hours, all right, not a bad idea to turn it over once in a while,', ""and all you're trying to do, and why there's no way to screw up the cooking part of this."", ""You're just going to simmer it until the meat's tender."", 'See how that fork goes right into that meat?', ""That's done, all right, so like I said, it's going to take about three hours, but don't"", 'quote me on it.', 'Could take two and a half, could take four, plan accordingly.', ""All right, at that point, I'm going to go ahead and remove the lamb from the pot."", 'You can just cover it with foil while we finish the sauce, and finishing the sauce means', 'two things, the old skim and season.', ""So we're going to turn the heat up a little bit."", 'We want to bring this back to a simmer, and we need to skim off all that fat.', ""There's a ton of it."", 'Just take your ladle and skim all the fat off before you serve it, all right, and besides', 'deep fatifying the top, the other thing you should do is taste for seasoning.', ""Although I highly doubt you're going to have to do much adjusting."", 'But you know what, check just in case, maybe add a little salt.', ""And that's it, go ahead and throw your lamb shank on a plate."", ""I'm serving mine next to some lentils and rice."", ""I'm going to spoon over that incredible sauce."", ""I'm going to garnish with some whole cilantro leaves."", 'And there you go, authenticity, notwithstanding, this was a super delicious, incredibly tasty,', 'very spicy, exciting dinner.', 'Really forked tender, should just fall right off the bone, and just a very complex flavor.', 'Spicy, sweet, sour, aromatic, and that beautiful, subtly gamey lamb, just the absolute perfect', 'meat for this.', 'So I really, really hope you give this a try.', 'Head over to FoodWishes.com for all the ingredient amounts and more info, as usual.', 'And as always, enjoy.']","[0, 6000000, 7360000, 9280000, 10280000, 11400000, 12400000, 14400000, 17600000, 19120000, 21440000, 22440000, 25920000, 28840000, 30520000, 36760000, 38320000, 40240000, 43080000, 45879999, 50640000, 53519999, 54519999, 55519999, 57800000, 60320000, 62519999, 66519999, 70399999, 72880000, 75039999, 78000000, 81600000, 83120000, 84360000, 86000000, 88720000, 94200000, 95600000, 98400000, 102960000, 107800000, 111840000, 115440000, 117440000, 118880000, 122360000, 126640000, 128960000, 133120000, 134400000, 136079999, 140000000, 141440000, 143440000, 146160000, 149400000, 150680000, 152560000, 153640000, 154640000, 155640000, 162680000, 168960000, 169960000, 171960000, 176080000, 177600000, 179920000, 180920000, 184800000, 188080000, 193240000, 196760000, 200440000, 204079999, 205600000, 209239999, 211679999, 214840000, 215840000, 217920000, 219480000, 221360000, 224399999, 225480000, 229280000, 232320000, 237000000, 238400000, 240000000, 243720000, 246480000, 249480000, 252200000, 255519999, 261320000, 266760000, 271760000, 274560000, 276440000, 280159999, 281160000, 284440000, 288520000, 292160000, 294720000, 296000000, 300120000, 301240000, 305960000, 310080000, 312359999, 315840000, 318560000, 321159999, 324159999, 326960000, 333560000, 336480000, 341760000, 349160000, 350160000, 352360000, 356840000]","[6000000, 7360000, 9280000, 10280000, 11400000, 12400000, 14400000, 17600000, 19120000, 21440000, 22440000, 25920000, 28840000, 30520000, 36760000, 38320000, 40240000, 43080000, 45879999, 50640000, 53519999, 54519999, 55519999, 57800000, 60320000, 62519999, 66519999, 70399999, 72880000, 75039999, 78000000, 81600000, 83120000, 84360000, 86000000, 88720000, 94200000, 95600000, 98400000, 102960000, 107800000, 111840000, 115440000, 117440000, 118880000, 122360000, 126640000, 128960000, 133120000, 134400000, 136079999, 140000000, 141440000, 143440000, 146160000, 149400000, 150680000, 152560000, 153640000, 154640000, 155640000, 162680000, 168960000, 169960000, 171960000, 176080000, 177600000, 179920000, 180920000, 184800000, 188080000, 193240000, 196760000, 200440000, 204079999, 205600000, 209239999, 211679999, 214840000, 215840000, 217920000, 219480000, 221360000, 224399999, 225399999, 229280000, 232320000, 237000000, 238400000, 240000000, 243720000, 246480000, 249480000, 252200000, 255519999, 261320000, 266760000, 271760000, 274560000, 276440000, 280159999, 281159999, 284440000, 288520000, 292160000, 294720000, 296000000, 300120000, 301240000, 305960000, 310080000, 312359999, 315840000, 318560000, 321159999, 324159999, 326960000, 333560000, 336479999, 341760000, 349160000, 350160000, 352360000, 356840000, 358720000]"

CLIP ViT-L/14 @ 224px at 1FPS features are created using FrozenBiLM repo and provided as separate files when running scenic.projects.vid2seq.generate_from_file.

Below are the resulting predictions from the above sample inputs.

{"key": "fn9anlEL4FI", "pred": ["Add some garlic ginger paste some chopped onions and some salt to it.", "Turn it constantly and cook it for 3-4 minutes with a lid covered.", "Add some water to it and bring it to a boil and simmer.", "Add some water to it and simmer and let it cook for 15 minutes.", "Add some chopped potatoes and cook it for 15 minutes.", "Add a little bit more water and let it cook for 78 minutes on a low heat.", "Jump in and take a look at it.", "Put the lid on and cook for 78 minutes on a low heat.", "Add some garam masala to it and stir it over.", "Add some more salt if you like.", "Turn it constantly and cook it for about 5-6 minutes.", "Now, let's take a look at it again.", "Add some whole seeds to it.", "Turn it constantly and cook it for 5-6 minutes. In fact, let's take a look at it again. In fact, let's take a look at"], "gts": ["add garram masala seeds and a bay leaf to the oil", "add the lamb to the pot", "add garlic ginger paste and chopped onions to the pot", "add chili tumeric coriander cumin and salt", "add water to the pot", "add potatos to the pot", "add the tomatos to the pot", "add chili to the pot"], "pred_timestamps": [[138670, 148575], [148575, 178290], [183243, 203053], [208006, 262483], [287246, 341724], [341724, 361534], [371439, 386296], [391249, 401154], [401154, 411059], [411059, 416012], [420964, 425917], [430869, 445727], [445727, 450679], [450679, 460584]], "gts_timestamps": [[30000, 39000], [69000, 86000], [136000, 149000], [170000, 183000], [230000, 238000], [309000, 333000], [383000, 390000], [438000, 443000]]}
{"key": "-dh_uGahzYo", "pred": ["This is roasted in desi ghee.", "Take all the pieces of the meat and cook them in the ghee.", "When the ghee heats up.", "Add badi illachi.", "Add black cardamom powder fennel powder and saffron to the meat.", "When the ghee heats up.", "Add the paste of chili powder fennel powder and saffron to the meat.", "When the ghee heats up add badi illachi to the meat and cook it on a slow flame.", "Hello.", "Hello.", "Hello.", "Hello.", "Hello.", "Hello.", "Hello.", "Hello.", "Hello.", "Hello.", "Hello.", "Hello.", "Hello.", "Hello.", "Hello."], "gts": ["mix hanger chili powder ginger powder fennel powder and water", "add cumin seeds green cardamom cinnamon sticks to a blender", "heat some ghee in a pan", "add the black cardamom to the pan", "add the mutton to the pan", "add the mixture", "season with salt and cover the pot", "add the blended spice to the pot", "cover the pot"], "pred_timestamps": [[158805, 164476], [164476, 170148], [175820, 187163], [187163, 192834], [192834, 215521], [221193, 232536], [232536, 249551], [249551, 272237], [272237, 277909], [277909, 289252], [289252, 294924], [294924, 300595], [300595, 306267], [306267, 311938], [311938, 317610], [317610, 323282], [323282, 328953], [328953, 334625], [334625, 340296], [340296, 345968], [345968, 351640], [351640, 357311], [357311, 362983]], "gts_timestamps": [[105000, 120000], [125000, 132000], [138000, 145000], [146000, 148000], [183000, 196000], [224000, 230000], [247000, 259000], [334000, 345000], [381000, 383000]]}
{"key": "BktdaTg6_E4", "pred": ["Add some garlic ginger cherry tomatoes onion and water to a blender.", "Pulse it on and off until a smooth puree is formed.", "Put the lamb shanks in the oven.", "Put some clarified butter in a bowl and add roughly chopped garlic ginger cherry tomatoes onion and a little bit of water.", "Pulse it on and off until a smooth puree is formed.", "Now, let's get into the spice blend.", "Put four lamb shanks in a plastic bag and put some cider vinegar vegetable oil salt and tamarind concentrate in a bowl.", "Add some cumin cinnamon black pepper cayenne dry mustard and paprika. So let's get into the spice blend.", "First of all, let's add cumin cinnamon black pepper cayenne dry mustard and paprika.", "So let's get into the spice blend.", "First of all, let's add cumin cinnamon black pepper cayenne dry mustard and paprika. So let's get into the spice blend. So let's add cumin cinnamon black pepper cayenne"], "gts": ["mix vegetable oil salt and curry masala", "marinate the lamb in a ziplock bag", "season the lamb meat with salt", "bake the lamb meat in an oven", "blend garlic ginger cherry and onion and water", "heat some clarified butter in a pan", "add chopped onion and salt and saute", "mix some cumin cinnamon black pepper and paprika", "add the mixed spices the mixture and the lamb in"], "pred_timestamps": [[105183, 108940], [108940, 116453], [123966, 127723], [127723, 135236], [138992, 142749], [142749, 154019], [157775, 180315], [180315, 184071], [195341, 199097], [199097, 206611], [206611, 214124]], "gts_timestamps": [[30000, 57000], [62000, 75000], [88000, 90000], [91000, 98000], [99000, 118000], [123000, 133000], [134000, 155000], [156000, 172000], [183000, 252000]]}

In the second prediction, "Hello." is repeated over and over. Maybe this weird behavior is what degraded the performance, but I'm not sure how to resolve it.

Here are predicted token ids from the same samples before decoding.

[32128 32133   101    31    60   352    12  2328    16   128   829  7299    42  5260   265  9358   521     5 32134 32140   101    31    60   352    12 22445     8 17871     5 32141 32147 13522    34   147     5 32150 32154  2334   128  9119 15698 11388     5 32155 32158  2334 18510 13211     5 32158 32161  2334   128 19245  4926 31073    11  3136     5 32163 32169 21599    34   147    11  3989    21    81   220  4278   676     5 32173  2334   128   387     5 32176 32180 10267   128  2107   387    11 19633    28     3     9 12533    30     5 32183 32186  2321     3     9   320    44    34     5 32188 32191  2334   128 18510 11076     5 32192 32195 13522   135    16     5 32199 32201  5306     8 12533   223    30    11  3989    21     3  3940   676     5 32202 32203  3521    46  1580    91     5     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0]
[32128 32138   101    31    60   352    12   617     3   354  2960  1788 11486    15 15698  4926   804  6677  4926    11   387    12     8     3    51 12499    11  3989     8  3604    28    34     5 32139 32143  2334     3   107    53  4926     3  5543    15   152 19245  4926 15698  4926    11   804  6677  4926    12     3     9 18942    11  2153    34   139     3     9   387    63 11388     5 32144 32150  2334     3   107    53  4926  1216    77  7299  1442   895   265    32    51 18684  4372    11     3   354  2960  1788 11486    15    12     8 18942    11  2153    34   139     3     9 27978  4926     5 32151 32158  2334  1001   895   265    32    51    12     8     3   122    88    15    11  3989     8  3604  5665    16     8     3   122    88    15     5 32167  2334     8 11388    13 19245  4926     3    89  5990    40  4926     3     7  4127    52   106  4926    11     3     7  4127    52   106    12     8     3   122    88    15     5 32169  2334     3     7  4127    52   106    12     8     3   122    88    15     5 32170  2334     3     7  4127    52   106    12     8     3   122    88    15     5 32171  2334     3     7  4127    52   106    12     8     3   122    88    15     5 32172  2334     3     7  4127    52   106    12     8     3   122    88    15     5 32173  2334     3     7  4127    52   106    12     8     3   122    88    15     5 32173  2334     3     7  4127    52   106    12     8     3   122    88    15     5 32173  2334     3     7  4127    52   106    12     8     3   122    88    15]
[32128 32134  5306     8 17871  6660  5979     7    16     3     9  2343  2182     5 32134 32143  1474 23119 15292 12065  1043  3136    11     3 22713 13119 11345   147     8 17871    11  7042    34   168     5 32144 32148 17039     8 17871    91    13     8  2182    11  4216    34    16     8  4836     5 32153 32159  2334  9119 15698 15665 11395 12909    11   387    12     3     9 18942    11  4764     5 32161 32169  5306     8  4194 12909    11  3136    16    11  3989   552    34  5050  7069  4216     5 32170 32172  2334  1216    77 18684  1001  5270   212    63  5990  2192 23756    11     3 16281  9629     9     5 32173  2334     3 16281  9629     9    12     8  3837     5 32173  2334     3 16281  9629     9    12     8  3837     5 32173  2334     3 16281  9629     9    12     8  3837     5 32173  2334     3 16281  9629     9    12     8  3837     5 32173  2334     3 16281  9629     9    12     8  3837     5 32173 32173  2334     3 16281  9629     9    12     8  3837     5     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0]

In the config, I set num_training_epochs=0 to run evaluation-only mode. I changed the name of the fine-tuned checkpoint from 'youcook-2' to 'checkpoint_200000' and let the config pick up the checkpoint for evaluation. This part was a little hacky, but this way, the checkpoint was properly loaded.

I used tokenizer downloaded from gs://t5-data/vocabs/cc_all.32000.100extra/sentencepiece.model. One caveat was that the model sometimes outputs token ids ranging from 32000 ~ 32127 which the tokenizer cannot properly handle. I manually excluded tokens in that range when decoding.

Other than the above changes along with eval_batch_size and data path, I left the provided config untouched.

Could you please suggest any idea where I did something wrong? What should I do to reproduce the results properly?

antoyang commented 1 year ago

Whisper is indeed a good ASR model. Did you apply a sentence segmentation tool as well on the ASR? Something I am not sure about is how much the trained checkpoints (which have been trained using Google ASR) are robust to the change in ASR data, but I do not expect using different ASR data to result in such big discrepancies. The issue in the repetition can be reduced by increasing the length penalty parameter, but I also don't think tuning this would lead to big discrepancies.

dreamgonfly commented 1 year ago

@antoyang Could you elaborate on the sentence segmentation tool? I used the ASR result from Whisper as it is, like in the same input below. It has start and end timestamps with associated sentences.

"video_id","duration","caption","start","end","asr_string","asr_start","asr_end"
"fn9anlEL4FI","490300000","['add garram masala seeds and a bay leaf to the oil', 'add the lamb to the pot', 'add garlic ginger paste and chopped onions to the pot', 'add chili tumeric coriander cumin and salt', 'add water to the pot', 'add potatos to the pot', 'add the tomatos to the pot', 'add chili to the pot']","[30000000, 69000000, 136000000, 170000000, 230000000, 309000000, 383000000, 438000000]","[39000000, 86000000, 149000000, 183000000, 238000000, 333000000, 390000000, 443000000]","[""Welcome back once again to how to cook great food.com. If you haven't already, click that"", 'button and subscribe to our channel. Only make it today, you can be making a lamb and potato', ""curry or masala. As you can see I've got my pan here and in there I've got some oil"", ""that's heating up nicely. I'm using a sunflower oil, go ahead and use any oil you like."", ""We're going to drop in some whole seeds or garam masala. So here they go. We want them"", ""to roast on pop and crackle. There's a bay leaf here. I've got in there some fennel seeds,"", ""cumin seeds, green cardamom, and black mustard seeds. That's what I'm using today for this."", ""They're going to release a wonderful flavour into that oil. Now we're going to go in with our lamb."", ""We're going to fry this for about five or six minutes just with the whole garam masala."", 'Here we go. This lamb has got burning. You can use chilli if you want.', ""So let's just cook this. Let's say it's got about five or six minutes."", ""Stir it over. I'm going to kind of above medium heat. We'll just see it a little bit."", ""Then we're going to add lots of other lovely spices."", 'You can see that our meat is browning really nicely. I mean it is no any accrued.', ""That's what we've got to do now is to get this meat nice and tender."", ""What I do here is I'm going to add some garlic ginger paste. That's a 50-50 mix of garlic and ginger."", ""It's about three of these little teaspoons in there. I'm going to add some chopped onions."", ""I'm using a red onion but go ahead and use white."", ""Then we're going to add some powders. As always if you've watched the channel I call these the big four there equal parts of chilli, coriander, cumin and turmeric."", ""If you'd like of course you can use your favourite curry powder. We're going to add some salt at this stage."", ""Let's flip this over."", ""We're going to cook this for about now three or four minutes. Turn it constantly."", 'Again on a kind of above medium heat.', ""We've got some lovely flavours happening now."", ""Now we're going to add some water."", 'That was cold water by the way.', ""We're just covering it a little bit."", ""We're going to bring this water to the boil and then we're going to simmer this with a lid on."", 'For about 15 minutes this is the part that I hope generally works.', ""We'll tend to as I meet them make it nice and soft."", ""So let's take a look now."", 'Look at that steam out of there.', 'This is cooking down beautifully. As you can see look at that.', 'The needs come straight up of that bone.', ""It's certainly on its way now."", 'A pretty essential part of doing this dish is to get your meat nice and tender.', ""You're getting that with an awful tough meat."", 'Now made of what cut you use.', ""Maybe you put really expensive lamb but it would still end up being tough if you don't go for this process."", ""I'm now going to add some potatoes."", ""We've tough peeled and chopped."", 'These are fairly small.', 'You cut them however you like and the cooking process will obviously take a longer time if you put them in as much bigger.', ""So again let's give this a mix."", 'Stir them in.', ""We've still got a decent amount of moisture in there from that water."", ""If you haven't at this point maybe you've got to really dry."", 'Add a bit more water now.', ""It's going to go back on."", ""I'm going to cook this for about 78 minutes on a fairly low heat."", 'Not a simmer, above a simmer.', ""Okay let's jump in now and take a look."", ""Let's look in more like it."", 'The potatoes are cooking very nicely.', 'I kind of like my potatoes quite soft.', ""I'm now at this stage going to add some chopped tomatoes."", ""I'm just going to spread them on the top."", ""I'll put the lid back on."", ""On a fairly low heat we're going to cook them just for about five minutes."", 'What they should do is break down with the steam.', ""Don't stir them at the moment."", 'The steam will break them down.', ""We're going to mix it around once and come back."", 'We may add a little tad more water perhaps.', ""And then we're pretty much done."", 'We should be now at the final stage.', 'Yeah these are soft and really nice the as you can see.', ""And they've given off a little bit of moisture as well."", 'Just now turning it over.', ""At this stage I'm going to add some fresh chilli."", ""It's totally optional as to how much you're putting."", ""I'm putting about four or five there."", 'You now need to check this for salt.', ""It's all good for me."", 'You can if you want finish that off with some fresh coriander or cilantro.', ""Let's just cook that for about two more minutes and it's done."", ""It's wonderful."", ""I'm really happy with it."", ""I'll see you again soon."", 'Take care.', 'Thank you.']","[0, 9800000, 14840000, 22140000, 26800000, 33800000, 41800000, 57300000, 61800000, 68800000, 86800000, 101800000, 109800000, 120800000, 127800000, 132800000, 139800000, 146800000, 161800000, 173800000, 185800000, 196800000, 205800000, 219800000, 226800000, 235800000, 241800000, 248800000, 254800000, 259800000, 266800000, 269800000, 273800000, 278800000, 282800000, 287800000, 292800000, 296800000, 298800000, 305800000, 309800000, 314800000, 318800000, 333800000, 337800000, 346800000, 350800000, 353800000, 358800000, 361800000, 366800000, 368800000, 372800000, 375800000, 378800000, 382800000, 388800000, 391800000, 393800000, 396800000, 399800000, 401800000, 403800000, 405800000, 408800000, 411800000, 415800000, 418800000, 423800000, 435800000, 438800000, 441800000, 446800000, 453800000, 456800000, 461800000, 464800000, 465800000, 467800000, 468800000, 469800000]","[9800000, 14840000, 22140000, 26800000, 33800000, 41800000, 48800000, 61800000, 68800000, 76800000, 101800000, 108800000, 115800000, 127800000, 131800000, 139800000, 146800000, 152800000, 173800000, 185800000, 192800000, 204800000, 211800000, 223800000, 231800000, 241800000, 246800000, 254800000, 259800000, 266800000, 269800000, 273800000, 278800000, 282800000, 286800000, 292800000, 296800000, 298800000, 305800000, 309800000, 313800000, 318800000, 333800000, 337800000, 340800000, 350800000, 353800000, 358800000, 361800000, 366800000, 368800000, 372800000, 375800000, 378800000, 382800000, 387800000, 391800000, 393800000, 396800000, 399800000, 401800000, 403800000, 405800000, 408800000, 411800000, 414800000, 418800000, 423800000, 426800000, 438800000, 441800000, 446800000, 450800000, 456800000, 461800000, 464800000, 465800000, 467800000, 468800000, 469800000, 471800000]"
"-dh_uGahzYo","561490000","['mix hanger chili powder ginger powder fennel powder and water', 'add cumin seeds green cardamom cinnamon sticks to a blender', 'heat some ghee in a pan', 'add the black cardamom to the pan', 'add the mutton to the pan', 'add the mixture', 'season with salt and cover the pot', 'add the blended spice to the pot', 'cover the pot']","[105000000, 125000000, 138000000, 146000000, 183000000, 224000000, 247000000, 334000000, 381000000]","[120000000, 132000000, 145000000, 148000000, 196000000, 230000000, 259000000, 345000000, 383000000]","['Hello, Namaste, Salamwalekum sastriya kal.', 'Welcome back to another session with your watch of at warawa.com.', 'Today I am going to show you another favorite of mine.', 'I am very surprised while I was checking the list of the dishes I did.', 'I did not make Mutton Rogen Josh.', 'Dear friends, this is one of the tastiest and super awesome dish from Kashmir.', 'You know, this is the dish what I learned from the master chefs only in five style hotels.', 'But I have seen in lot of restaurant they serve Mutton Rogen Josh.', 'They just serve the Mutton Curry and call it Rogen Josh and he does not have the punch', 'what Rogen Josh must have.', 'You know, a lot of people add onion, tomato and all this in making me Rogen Josh.', 'But what I am going to do today, I am not going to add onion or tomatoes nor even yogurt.', 'You know, if you want you can add a little bit of yogurt but I am not even going to add yogurt.', 'So for this the spice is what we are going to add is Javitri that is Mace, Cinnamon, Green Cardamom,', 'Cumin seeds, Black Cardamom and Saferan.', 'You know, I am going to make a powder of these four and add while as I black illaji or the black cardamom', 'and cook the meat with it.', 'Now here I have got one end of table spoon of chilli powder.', 'Not any chilli powder, Kashmiri chilli powder.', 'That is what will give nice red colour, ginger powder but one table spoon of final seed powder.', 'And we are going to add this in this quantity and that will give a very nice tasteful gravy.', 'Now to make it very simple I am going to mix all of these masalas together so you will understand.', 'So here I have got hing powder.', 'You know, hing is a must for Mutton Rogen Josh and in this add Kashmiri chilli powder.', 'Ginger powder and final seed powder.', 'And in this add water and mix this into a watery paste.', 'And now we are going to add in a blender.', 'I am going to add the cumin seeds, green cardamom, cinnamon sticks and Javitri that is Mace.', 'I am going to powder and add it.', 'So make it in a nice coarse powder.', 'You know, you are not going to cook it in the oil.', 'This Mutton Rogen Josh needs to be cooked in nice desi ghee.', 'When this desi ghee heats up we are going to add badi illachi that is black cardamom.', 'That will give a nice flavour to this dish.', 'And here I have got meat.', 'This is a nice lamb meat and all these meats have bone.', 'Nally that is the shanks of meat.', 'And take all the pieces which are like shanks.', 'And when these get cooked like this with the bone in, the gravy becomes nice and very', 'flavourful.', 'Now here the ghee is heated up and my black cardamom is nicely roasted in this add pieces', 'of this meat.', 'And we are going to cook this meat in this ghee.', 'And you have to cook in the meat becomes slightly brown.', 'That is when you get a very good flavour to the gravy.', 'You know, it is better always to fry the meat like this and then cook it on a slow', 'flame.', 'You know, now look at this meat.', 'This is nicely slightly brown and you know, this method of cooking is used not only in', 'India but throughout the world.', 'When you roast the meat like this, it is called Milad effect.', 'What it does is it caramelizes the outer coating of meat and gives a very nice flavour to', 'this dish.', 'Now this is all ready.', 'Now in this, you are going to add the mixture, the paste of the chili powder and soft', 'powder, fennel powder into this.', 'And you can also add little of saffron.', 'You know, this will also give a very nice flavour to this dish and pour in a lot of water', 'to cover the meat.', 'You know, because I wanted to show you, I am cooking in such a big pan.', 'Otherwise I would have taken a little smaller pan like this but you know, to make sure that', 'you see what is happening in the pan I took a vessel like this.', 'Now put the lid on and cook it on a slow flame for at least one hour to one and a half', 'hour.', 'Another easy method of, if you do not have patience to spend one and a half hour of slow cooking,', 'easy method is just pour this into a pressure cooker, cook it and again transfer it back', 'in this pan because you want this masala also to be cooked.', 'In a slow method of cooking like this, what it does is it evaporates the water because', 'we added little extra water in this, that water will be overrated and when the sauce is done,', 'it has to be liquidy but all the masala needs to be cooked.', 'And here is the masala of cumin, cardamom, cinnamon sticks.', 'And when you are cooking in an open method like this, when this is cooked for like half', 'of the time that is almost 45 minutes, then we are going to add this.', 'But if you do in the pressure cooker, you will have to add it after the meat is cooked.', 'After cooking for almost 45 minutes, now look at this gravy, this is nice, the oil is', 'also slightly floating on top and you can see this meat.', 'The lamb bones were not visible when we started but look at this.', 'Now after 45 minutes, they coming off the bone, that is when you know that the meat is', 'getting nicely cooked.', 'Now here is the masala powder of a maze that is Javitri, cinnamon, cardamom, cumin and', 'all this and then we are going to add to this.', 'This is what will give a nice flavor to your Rogen Josh.', 'Just add all of this, mix it and we are going to cook this for another 30 minutes at least.', 'Till the time the meat is become nice and tender, the meat should be so much cooked that', 'it should be coming off the bone and also when it is properly done, this meat will literally', 'melt in your mouth, that is when you got a perfect and a super awesome tasty Rogen Josh.', ""So dear friends, you don't need to add curd, no tomatoes, no onions."", 'Just with this masala, you will not believe how much awesome flavor this is already giving.', 'So let me put the lid on and if you need to add little water, you can keep adding little', 'water till you get the desired consistency.', 'After cooking it for almost another 30 minutes, the flavor of Rogen Josh has spread all', 'over and you can see how the Rogen means, this oil that is floating, red in color, look', 'at this.', 'That is what makes this awesome dish super to look at and tasty also and wow, you know', 'if you make it right, this will taste super fantastic.', 'It is so good.', 'Trust me, make it the way I have shown you and it will be super fantastic.', 'Dear friends, this is something magical, this is something super awesome.', 'But you use nice lamp shanks to make it and take it easy on the ginger powder and you', 'will get nice perfect Rogen Josh.', 'You know while we were in the college, they always used to add Ratanjog.', 'But when I was in the industry, they told no, no, no Ratanjog in this, the color should', 'come from the Kashmiri chillies, the flavor, the aroma should come from saffron and some', 'people also add coxcom is a kind of a flower which gives also a nice coloring.', 'Some people add that but for me, this is super perfect.', 'I am going to switch off the flame and I am going to enjoy it hot along with my non-vav.', 'Now look at this Rogen Josh.', 'Wow, what flavors along with this?', 'So much perfectly cooked and you know, especially when I cook meat like this, I want the meat', 'to be fully in my s and tanda but still retains some of the pink color.', 'As a reason why I did not add turmeric but some people add if you want, you can add', 'turmeric also but dear friends, wow.', 'You know, eat with basmati rice or nice mughalai, non like the one what I am eating.', 'This is super and the nice sauce is also nice sticky and super tasty.', ""Dear friends, I hope you enjoyed today's session of learning how to make this awesome"", 'Mutton Rogen Josh from Kashmir but do not forget, Vahrehvah is all about inspiring', 'others to cook.', 'So please post your recipes and cooking tips at vahrehvah.com.', 'So others can benefit from your great cooking.', 'Thank you.']","[0, 10840000, 14560000, 17440000, 20840000, 24000000, 29720000, 34519999, 38360000, 43120000, 45200000, 50200000, 56560000, 60400000, 67640000, 71040000, 78120000, 79720000, 84360000, 87360000, 93880000, 98800000, 104600000, 107000000, 114000000, 119160000, 124600000, 127560000, 134240000, 136200000, 139720000, 141920000, 145880000, 150760000, 153600000, 156800000, 161760000, 164640000, 167200000, 172920000, 173920000, 183200000, 184200000, 188320000, 192840000, 196760000, 201300000, 202300000, 204720000, 210840000, 212640000, 216520000, 221440000, 222440000, 224240000, 231680000, 234520000, 238280000, 246280000, 247280000, 252000000, 256839999, 259880000, 265399999, 266399999, 271240000, 276480000, 281080000, 285080000, 290200000, 293520000, 298280000, 302760000, 306320000, 312320000, 318640000, 322880000, 326640000, 331039999, 332560000, 340320000, 343520000, 347560000, 354120000, 359479999, 364799999, 371760000, 376640000, 383000000, 387600000, 390760000, 396800000, 405120000, 406280000, 416480000, 425760000, 426760000, 434760000, 444120000, 450560000, 454000000, 458600000, 464400000, 469760000, 475960000, 480320000, 487320000, 492760000, 503240000, 509240000, 514840000, 519480000, 522880000, 531160000, 538320000, 542840000, 547520000, 548520000, 551760000, 554000000]","[10840000, 14560000, 17440000, 20840000, 24000000, 29720000, 34519999, 38360000, 43120000, 45200000, 50200000, 56560000, 60400000, 67640000, 71040000, 78120000, 79720000, 84240000, 87360000, 93880000, 98800000, 104600000, 107000000, 114000000, 119160000, 124600000, 127560000, 134240000, 136200000, 139000000, 141920000, 145880000, 150760000, 153600000, 156800000, 161760000, 164640000, 167200000, 172920000, 173920000, 183200000, 184200000, 188320000, 192839999, 196760000, 201300000, 202300000, 204720000, 210840000, 212640000, 216520000, 221440000, 222440000, 224240000, 231680000, 234520000, 238280000, 246280000, 247280000, 252000000, 256839999, 259880000, 265399999, 266399999, 271240000, 276480000, 281080000, 285080000, 290200000, 293520000, 298280000, 302760000, 306320000, 312320000, 318640000, 322880000, 326640000, 331039999, 332560000, 340320000, 343520000, 347560000, 354120000, 359400000, 364799999, 371760000, 376640000, 383000000, 387599999, 390760000, 396800000, 405120000, 406280000, 416480000, 425760000, 426760000, 434760000, 444120000, 450560000, 454000000, 458600000, 464400000, 469680000, 475960000, 480320000, 487320000, 492760000, 503240000, 509240000, 514840000, 519480000, 522880000, 531160000, 538320000, 542840000, 547520000, 548520000, 551760000, 554000000, 554360000]"
"BktdaTg6_E4","371900000","['mix vegetable oil salt and curry masala', 'marinate the lamb in a ziplock bag', 'season the lamb meat with salt', 'bake the lamb meat in an oven', 'blend garlic ginger cherry and onion and water', 'heat some clarified butter in a pan', 'add chopped onion and salt and saute', 'mix some cumin cinnamon black pepper and paprika', 'add the mixed spices the mixture and the lamb in']","[30000000, 62000000, 88000000, 91000000, 99000000, 123000000, 134000000, 156000000, 183000000]","[57000000, 75000000, 90000000, 98000000, 118000000, 133000000, 155000000, 172000000, 252000000]","['Hello, this is Chef John from Foodwishes.com with Lamb Shank Vindaloo.', ""That's right, I get a lot of complaints."", 'How come you never do Indian food?', ""It's because I'm scared."", ""I don't have a lot of experience with it."", 'I love to eat it.', 'But I thought I would give this one of my favorites to try.', 'This very spicy lamb type curry dish.', 'So I hope I got it close.', 'You Indian cuisine experts will be the judge.', 'So here we go.', ""So step one here, I'm going to put four lamb shanks in a plastic bag."", 'You need to get marinated overnight before we start the dish.', ""So I'm going to place those in."", ""And then into a bowl, I'm going to pour some cider vinegar, some vegetable oil, some salt,"", 'and then something called tamarind.', ""I'm using a tamarind concentrate."", ""And we'll talk a little bit about that on the blog."", ""But it's a very tart, sour kind of citrus-like ingredient."", 'All right, I started mixing that up and then I realized I never put the garmasala in,', 'which is a blend of Indian spices.', ""We've used that before."", 'We like it.', ""All right, so I'm going to mix that in and that's basically the marinade."", ""So we're going to pour that over the lamb shanks."", ""We're going to seal up that bag really well."", 'All right, just to confuse you, I put mine in a second bag as I thought I had a leak.', ""We're going to squeeze out as much air as possible so the meat is immersed in the marinade."", ""And then we're going to put that in the fridge overnight."", 'Not a bad idea to turn it over once in a while.', ""All right, the next day I'm going to pull it out of the bag."", ""I'm going to place it on an oiled foil lined sheet pan."", ""Don't throw away the marinade, by the way."", ""That's going in the stew later."", 'So just reserve the marinade.', ""I'm going to salt those generously on both sides."", ""And we're going to brown those in a very hot oven for 50, for 15 or 20 minutes until"", ""they're nice and brown."", ""We're going to pull those out and reserve them till needed."", ""Next up in a blender, we're going to add a lot of garlic, a lot of ginger, some cherry"", 'tomatoes, a nice big onion, and a little bit of water.', ""We're going to pulse that on and off until we have a nice smooth puree."", 'And it kind of looks like a delicious strawberry smoothie.', ""And yet it's so the opposite of that."", 'So just set that aside.', ""And it's back over to this stove where we're going to start the actual vindaloo."", ""So we're going to put a heavy Dutch oven on medium high heat."", ""And I'm going to put in some clarified butter."", 'Now this is supposed to be something called ghee, which is basically a clarified butter.', 'But clarified butter will work.', ""All right, so I'm going to put my butter in."", ""I'm going to throw a roughly chopped onion in there with a big pinch of salt."", ""And we're going to brown this."", ""And I'm not talking golden brown."", ""I'm talking almost golden black."", ""That's going to add sweetness and a depth of color to the sauce."", 'So just keep cooking them.', ""And right there you're thinking, that's probably good."", ""It's not."", 'Let them go further.', 'OK?', ""While those are browning, I'm going to get my spice blend together, which is cumin, cinnamon,"", 'black pepper, cayenne, and a lot of it, dry mustard, and paprika.', 'OK?', 'And all that will be on the blog, of course.', ""All right, we're going to go back over the stove, check the onions, and now we're talking."", ""That's what we want."", 'Nicely browned, very dark edges.', 'Perfect.', ""And at that point, we're going to back the heat down to medium and dump in the spices."", ""And we're going to kind of toast the spices in that hot butter."", 'And that really wakes up the flavor, and it really, really adds an extra dimension, which', ""I guess would be the fifth dimension, but who's keeping track?"", 'Not only needs to cook for about two minutes, but it really does make a difference.', ""All right, after that, we're going to go ahead and dump in the marinade that was left"", 'over from the bag of lamb.', 'All right, remember that was the cider and the tamarind and the oil.', ""All right, so I'm going to dump that in."", ""And then we're going to dump in the mixture from the blender, the onion, the tomato, the"", 'ginger, the garlic.', ""We're going to give that a stir."", ""We're going to raise the heat up to high."", 'We want to bring this up to a simmer.', ""And before we put the lamb back in, we're going to go ahead and add a little bit of brown"", 'sugar.', 'Just to balance out that acidity and heat, all right, so stir that in.', 'And then we can place our lamb back in.', ""And if you're using a similar sized pot, you should have enough liquid to just almost"", 'come up to the top.', ""It doesn't have to be totally covered."", ""This is going to stew for three hours, and we're going to turn these several times."", ""So as long as you have that much liquid, you're okay."", ""If you need that, another splash of water, don't be afraid."", ""Don't forget you can always reduce sauces later."", 'So once the lamb goes in, I want you to turn the heat down to low.', 'I want you to cover it tightly, and I want you to simmer that very slowly on very low', 'heat for about three hours, all right, not a bad idea to turn it over once in a while,', ""and all you're trying to do, and why there's no way to screw up the cooking part of this."", ""You're just going to simmer it until the meat's tender."", 'See how that fork goes right into that meat?', ""That's done, all right, so like I said, it's going to take about three hours, but don't"", 'quote me on it.', 'Could take two and a half, could take four, plan accordingly.', ""All right, at that point, I'm going to go ahead and remove the lamb from the pot."", 'You can just cover it with foil while we finish the sauce, and finishing the sauce means', 'two things, the old skim and season.', ""So we're going to turn the heat up a little bit."", 'We want to bring this back to a simmer, and we need to skim off all that fat.', ""There's a ton of it."", 'Just take your ladle and skim all the fat off before you serve it, all right, and besides', 'deep fatifying the top, the other thing you should do is taste for seasoning.', ""Although I highly doubt you're going to have to do much adjusting."", 'But you know what, check just in case, maybe add a little salt.', ""And that's it, go ahead and throw your lamb shank on a plate."", ""I'm serving mine next to some lentils and rice."", ""I'm going to spoon over that incredible sauce."", ""I'm going to garnish with some whole cilantro leaves."", 'And there you go, authenticity, notwithstanding, this was a super delicious, incredibly tasty,', 'very spicy, exciting dinner.', 'Really forked tender, should just fall right off the bone, and just a very complex flavor.', 'Spicy, sweet, sour, aromatic, and that beautiful, subtly gamey lamb, just the absolute perfect', 'meat for this.', 'So I really, really hope you give this a try.', 'Head over to FoodWishes.com for all the ingredient amounts and more info, as usual.', 'And as always, enjoy.']","[0, 6000000, 7360000, 9280000, 10280000, 11400000, 12400000, 14400000, 17600000, 19120000, 21440000, 22440000, 25920000, 28840000, 30520000, 36760000, 38320000, 40240000, 43080000, 45879999, 50640000, 53519999, 54519999, 55519999, 57800000, 60320000, 62519999, 66519999, 70399999, 72880000, 75039999, 78000000, 81600000, 83120000, 84360000, 86000000, 88720000, 94200000, 95600000, 98400000, 102960000, 107800000, 111840000, 115440000, 117440000, 118880000, 122360000, 126640000, 128960000, 133120000, 134400000, 136079999, 140000000, 141440000, 143440000, 146160000, 149400000, 150680000, 152560000, 153640000, 154640000, 155640000, 162680000, 168960000, 169960000, 171960000, 176080000, 177600000, 179920000, 180920000, 184800000, 188080000, 193240000, 196760000, 200440000, 204079999, 205600000, 209239999, 211679999, 214840000, 215840000, 217920000, 219480000, 221360000, 224399999, 225480000, 229280000, 232320000, 237000000, 238400000, 240000000, 243720000, 246480000, 249480000, 252200000, 255519999, 261320000, 266760000, 271760000, 274560000, 276440000, 280159999, 281160000, 284440000, 288520000, 292160000, 294720000, 296000000, 300120000, 301240000, 305960000, 310080000, 312359999, 315840000, 318560000, 321159999, 324159999, 326960000, 333560000, 336480000, 341760000, 349160000, 350160000, 352360000, 356840000]","[6000000, 7360000, 9280000, 10280000, 11400000, 12400000, 14400000, 17600000, 19120000, 21440000, 22440000, 25920000, 28840000, 30520000, 36760000, 38320000, 40240000, 43080000, 45879999, 50640000, 53519999, 54519999, 55519999, 57800000, 60320000, 62519999, 66519999, 70399999, 72880000, 75039999, 78000000, 81600000, 83120000, 84360000, 86000000, 88720000, 94200000, 95600000, 98400000, 102960000, 107800000, 111840000, 115440000, 117440000, 118880000, 122360000, 126640000, 128960000, 133120000, 134400000, 136079999, 140000000, 141440000, 143440000, 146160000, 149400000, 150680000, 152560000, 153640000, 154640000, 155640000, 162680000, 168960000, 169960000, 171960000, 176080000, 177600000, 179920000, 180920000, 184800000, 188080000, 193240000, 196760000, 200440000, 204079999, 205600000, 209239999, 211679999, 214840000, 215840000, 217920000, 219480000, 221360000, 224399999, 225399999, 229280000, 232320000, 237000000, 238400000, 240000000, 243720000, 246480000, 249480000, 252200000, 255519999, 261320000, 266760000, 271760000, 274560000, 276440000, 280159999, 281159999, 284440000, 288520000, 292160000, 294720000, 296000000, 300120000, 301240000, 305960000, 310080000, 312359999, 315840000, 318560000, 321159999, 324159999, 326960000, 333560000, 336479999, 341760000, 349160000, 350160000, 352360000, 356840000, 358720000]"

One thing I noticed is that ground truth captions are all lowercase, while ASR results are not. Also, ground truth captions do not include punctuations, while ASR results do.

Not sure if it's because of these differences, but the prediction sentences have capitalized first letters.

{"key": "fn9anlEL4FI", "pred": ["We're going to drop in some whole seeds or garam masala.", "We're going to fry the lamb.", "Stir it over.", "Add some garlic ginger paste.", "Add chopped onions. Add some coriander cumin and turmeric.", "Add some chili powder chili coriander cumin turmeric and salt and cook.", "Add some water to the pan.", "Bring the water to a boil and simmer.", "Take a look at the meat.", "Add some chopped potatoes.", "Stir them in add some water and let it cook.", "Turn the heat down and simmer for 78 minutes.", "Turn off the heat and let the meat cool down completely and then turn on the heat again."], "gts": ["add garram masala seeds and a bay leaf to the oil", "add the lamb to the pot", "add garlic ginger paste and chopped onions to the pot", "add chili tumeric coriander cumin and salt", "add water to the pot", "add potatos to the pot", "add the tomatos to the pot", "add chili to the pot"], "pred_timestamps": [[0, 24762626], [29715151, 59430303], [69335353, 99050505], [108955555, 128765656], [133718181, 143623232], [168385858, 208006060], [212958585, 222863636], [227816161, 257531313], [277341414, 287246464], [297151515, 312009090], [321914141, 356581818], [361534343, 371439393], [376391919, 416012121]], "gts_timestamps": [[30000000, 39000000], [69000000, 86000000], [136000000, 149000000], [170000000, 183000000], [230000000, 238000000], [309000000, 333000000], [383000000, 390000000], [438000000, 443000000]]}
{"key": "-dh_uGahzYo", "pred": ["We are going to add javitri mace cinnamon green cardamom and black cardamom to the meat and cook the meat with it.", "We are going to add hing powder chili powder ginger powder and final seed powder to a blender.", "Add water and mix it into a watery paste.", "Add cumin seeds green cardamom cinnamon stick and black cardamom to the meat and cook the meat in the ghee. When the ghee heats up add pieces of lamb to it and cook the meat on a slow flame. Now add the paste to the ghee. Add the paste to the meat. Add the meat to the ghee. Add the paste to the meat. Add the meat to the ghee. Add the paste to the meat. Add the meat to the ghee. Add the paste to the meat.", "Add the meat to the ghee and cook.", "Add the paste to the meat.", "Add salt to the meat.", "Add garlic powder salt black pepper and mix it all together.", "Add"], "gts": ["mix hanger chili powder ginger powder fennel powder and water", "add cumin seeds green cardamom cinnamon sticks to a blender", "heat some ghee in a pan", "add the black cardamom to the pan", "add the mutton to the pan", "add the mixture", "season with salt and cover the pot", "add the blended spice to the pot", "cover the pot"], "pred_timestamps": [[0, 28358080], [51044545, 73731010], [79402626, 113432323], [119103939, 153133636], [272237575, 289252424], [294924040, 300595656], [300595656, 306267272], [306267272, 334625353], [340296969, 345968585]], "gts_timestamps": [[105000000, 120000000], [125000000, 132000000], [138000000, 145000000], [146000000, 148000000], [183000000, 196000000], [224000000, 230000000], [247000000, 259000000], [334000000, 345000000], [381000000, 383000000]]}
{"key": "BktdaTg6_E4", "pred": ["Put the lamb shanks in a plastic bag and marinate with cider vinegar vegetable oil salt and tamarind concentrate.", "Pour the marinade over the lamb and seal it well and keep it in the fridge.", "Take the lamb shanks out of the marinade and brown them in the oven.", "Add garlic ginger tomatoes onion and water to a blender and pulse.", "Put some clarified butter a roughly chopped onion a pinch of salt and brown it.", "Keep cooking until they turn golden brown.", "Add cumin cinnamon black pepper cayenne pepper and cayenne pepper to the spice blend. Add that to the sauce. Add the spice blend to the sauce.", "Put the lamb shanks in a bowl and pour cider vinegar vegetable oil salt and tamarind concentrate over it.", "Put the sauce on and serve."], "gts": ["mix vegetable oil salt and curry masala", "marinate the lamb in a ziplock bag", "season the lamb meat with salt", "bake the lamb meat in an oven", "blend garlic ginger cherry and onion and water", "heat some clarified butter in a pan", "add chopped onion and salt and saute", "mix some cumin cinnamon black pepper and paprika", "add the mixed spices the mixture and the lamb in"], "pred_timestamps": [[0, 22539393], [22539393, 56348484], [82644444, 97670707], [101427272, 116453535], [123966666, 154019191], [154019191, 157775757], [169045454, 195341414], [217880808, 236663636], [240420202, 244176767]], "gts_timestamps": [[30000000, 57000000], [62000000, 75000000], [88000000, 90000000], [91000000, 98000000], [99000000, 118000000], [123000000, 133000000], [134000000, 155000000], [156000000, 172000000], [183000000, 252000000]]}

To make reproduction easier, could you consider releasing Google ASR results for ActivityNet Captions and YouCook2 (at least for validation split)? My guess is the trained checkpoint is not much robust to the change in ASR data.

antoyang commented 1 year ago

The ASR input for Vid2Seq was formatted into sentences. The sentence segmentation tool I used is the one from the Google API. It is unfortunately not possible to release the Google ASR results. Using Whisper, maybe the sentence segmentation tool from https://github.com/m-bain/whisperX would help? I think I also formatted the ground-truth captions with capitalization and added a point at the end (either during data processing or data loading).

thechargedneutron commented 1 year ago

@dreamgonfly Were you able to reproduce the results using some other ASR text and/or improving your current implementation? I am also interested in using finetuned dense captioning model for my application. Thanks!

ee2110 commented 1 year ago

Hi, @dreamgonfly would you mind sharing the code of how you implement it? I try to implement vid2seq but there are issues unable to solve, for example

AttributeError: module 'flax.config' has no attribute 'update'

some bugs I have no idea which line to change

Inch-Z commented 1 year ago

@antoyang Have you solved the question? I've met the same problem.

anilbatra2185 commented 1 year ago

hi @dreamgonfly , I wonder if you tried training the model without transcripts and get the similar results in Table 2, Row#1, as this does not need any pre-training or ASR.

@antoyang, I followed the same steps mentioned by @dreamgonfly, however, I am unable to train the model with Visual input only. The training loss becomes NaN after few iterations and caption metrics are Zero always. Can you suggest something to resolve the issue?

Thanks!

anilbatra2185 commented 12 months ago

I am unable to reproduce the results of Row#1 in Table 2 i.e. using only visual input without any pre-training.

image

I am using single A100 (80 GB) GPU to run the code with batch size of 32. With default config, I got NaN loss during training. Hence to avoid it, I modified the following params:

YouCook2 Dataset Details:

Here are the results: image

Predictions: pred_txt.txt

if anyone can help or suggest something to reproduce the results, that will be great!

dreamgonfly commented 11 months ago

Hi, @thechargedneutron . I still cannot reproduce the results. With ASR results from Whisper I could improve the results a bit, but it was still far below the reported numbers on the paper.

@ee2110 I can provide you with the detailed instructions on how I run the vid2seq code. Please email me if you're interested!

@anilbatra2185 I did not try to train the model yet. I was going to evaluate the released checkpoint first but I got stuck.

anilbatra2185 commented 11 months ago

thanks @dreamgonfly for replying back!

As I am unable to reproduce the results with only visual input, which make me think that ASR (from whisper) might not be the concern. There might be some missing config parameter or change in code behaviour with latest versions of libraries it used.

dreamgonfly commented 11 months ago

@antoyang I tried WhisperX and the results slightly got better, but still far below the reported performance. (e.g., METEOR 4.3 on YouCook2 with Whisper -> 4.5 with WhisperX vs. 9.3 from paper) This result is from the released checkpoint without training any parameters.

Based on these results, I agree with @anilbatra2185 that the main concern might not be with ASR, but with visual features or other parts of the code.

antoyang commented 11 months ago

Interesting. I don't have the bandwidth to look at it further now but will release a PyTorch implementation by the end of September.

PKUCSS commented 11 months ago

@antoyang Thanks for the great work. Could you please release the prediction results for our reference?

BaoliangChen-stu commented 10 months ago

Interesting. I don't have the bandwidth to look at it further now but will release a PyTorch implementation by the end of September.

It's a great job, and I'm looking forward to seeing your work based on Pytorch. Can you provide a script for inference for input video (arbitrary input video).

Hasnat79 commented 10 months ago

@dreamgonfly Can you help us by sharing your implementation instruction and code? I am trying to generate inference for multiple event based videos.

antoyang commented 10 months ago

@PKUCSS I do not have access to this given that it was an internship work. @BaoliangChen-stu A PyTorch implementation (with a few differences explained in the readme) is included here: https://github.com/antoyang/VidChapters. It also includes an example of inference script.