Open enrico310786 opened 2 years ago
Hi, thank for your questions:
Hi, thank for the hint.
If I want to use an Italian BERT with the correct Italian tokenizer, in order to pre-train the model with the Pre-training datasets (with the sentences translate in italian language), could i use the pretrain script?
python -m torch.distributed.run --nproc_per_node=8 pretrain.py --config ./configs/Pretrain.yaml --output_dir output/Pretrain
After that, i could fine tune the pretrained model on the COCO dataset always with the sentences translate in italian language. Could it make sense?
Yes you could do that.
Hi, congratulation for the results.
My questions are about a correct use of the exit features for the retrieval task and of the finetuning phase. 1) In the colab notebook, on the section 'Feature extraction' the model has three possible outputs: multimodal_feature, image_feature and text_feature. Are they the outputs of the, respectively, image-grounded text encoder, image encoder and text encoder? Thus, if i want to check if two images+text are similar i have to measure the distances from their multimodal_features, right? If, on the other hand, I just want to verify the similarity between only images or texts, i have to use just the image_feature or text_feature, right? 2) To perform the feature extraction i see that the colab notebook uses the model_base.pth. May I use the already finetuned models for Image-Text Retrieval (COCO) in BLIP w/ ViT-B or BLIP w/ ViT-L? Are the able to extract the multimodal_feature, image_feature and text_feature in the same way of the model_base? 3) Instead of fine tuning the Image-Text Retrieval model on COCO dataset, may i use the same script but another dataset, for example with the image caption in italian language instead of english? If the language changes, it is necessary to change parameters in the BERT language model or the model adapts to the Italian language during fine tuning?
Many thanks, Enrico