Ask for training details

lzw-lzw / GroundingGPT

[ACL 2024] GroundingGPT: Language-Enhanced Multi-modal Grounding Model

Apache License 2.0

283 stars 14 forks source link

Ask for training details #9

Closed HYOJINPARK closed 4 months ago

HYOJINPARK commented 4 months ago

Hi, Thanks for your great work.

I try to reproduce your code and thus I would like to ask more details. In your dataset, there are 3 stage in released dataset and paper description in 3.2. However, the script has only pretrain.sh and finetune.sh

Also could you describe more about dataset folder location layout? Is it ok like below?

video_folder --- ActivityNet --- DiDeMO ---- Charades-STA ......
image_folder --- COCO --- OCR-VQA --- .....etc

Also, Could I train the model without sound dataset?

lzw-lzw commented 4 months ago

Hello, thank you for your attention.

For the first question, finetune.sh is applicable to the last two stages. You just need to modify the paths for input/output models and the dataset you want to use.
For the second question, your format is correct. You just need to replace "image_folder" in the shell script with the path to your image folder. You can refer to the code provided here: https://github.com/lzw-lzw/GroundingGPT/blob/1f3f53e5e899b7ae24fe3c8b8bdc803ba66a3f0a/lego/train/train.py#L649-L658 I will update the details for dataset folder as soon as possible.
Yes, you can choose any modality of data you want for training.

HYOJINPARK commented 4 months ago

Hi @lzw-lzw Thanks for your prompt reply

One of my confusion is that when I open here: https://huggingface.co/datasets/zwli/GroundingGPT/tree/main/Stage1 Only Wavecaps.json in here Likewise Stage2 has only Didemo, refcoco,vggs and visual_genome. Stage3 has clotho, activitycpation and flckr30k

Did you use only Wavecpas for stage1?

What data should I use at each stage?

lzw-lzw commented 4 months ago

Hi, the data used for the last two stages is just the files within the corresponding directory of the HuggingFace dataset. In the first stage, in addition to Wavecaps being used as audio modality data, it also includes pretraining data from LLaVA and VALLEY.

HYOJINPARK commented 4 months ago

Hi @lzw-lzw Thanks again your kind reply Actually did you downloaded Valley dataset using this link? https://huggingface.co/datasets/luoruipu1/Valley-Instruct-65k/blob/main/get_jukinmedia_videourl.py Like below code?

response = requests.post('https://www.jukinmedia.com/api/public/video/downloadVideo/'+jmId,headers=headers)

Actually...it looks the link is not working...

lzw-lzw commented 4 months ago

Actually, I used the valley data that has been downloaded and stored on an internal NAS by someone else, so I am not familiar with the details of the download process. Perhaps referring to the Valley repository can provide a solution or guidance for this.

HYOJINPARK commented 4 months ago

Got it Thanks for all your help!!!