How can I use the pre-trained M4C Captioner on a sample image?

nmalboubi commented 3 years ago

I've loaded the data and pre-trained modules by following the installation instructions. However, for prediction it seems like the focus is only on val/test data. How would I just use the model to get the caption for a sample image? Can you share any resources on this?

guanghuixu commented 3 years ago

Do you want to use a pre-trained model to generate captions for any given images (not train/val/test)?

First, you have to use an OCR system to recognize texts in the image, and a detection model to detect visual objects. To prepare dataset as the format required by the code, so that the code can run it normally.

Whether our method or M4C is concentrated in the downstream reasoning task, so you need to prepare the multimodal features first.

nmalboubi commented 3 years ago

@guanghuixu yes, exactly! pre-trained model to generate captions for any given images. I have the OCR system to pull the raw text from the image. As for the detection model, I'm using

https://github.com/facebookresearch/mmf/blob/master/tools/scripts/features/extract_resnet152_feat.py

to extract the image features. However, I'm not sure what to do after this step. Where would I feed these two pieces of information to run the AnchorCaptioner or M4C?

guanghuixu commented 3 years ago

Just imitate the original dataset annotation approach to generate a new validation set, which contains the images you want to process. So you need to look at the annotation files in the original dataset directory. For example, for visual features, there will be two annotation files, imageid.npy and imageid_info.npy. After generating the required annotation files, you only need to put them in the corresponding directory.

nmalboubi commented 3 years ago

Thank you for the swift reply. To make sure I'm looking at the correct files, can you please post the links to the "annotation files in the original dataset directory"? Are they the imdb .npy files found here:

https://github.com/guanghuixu/AnchorCaptioner/releases/tag/data?

guanghuixu commented 3 years ago

https://github.com/guanghuixu/AnchorCaptioner/blob/main/configs/captioning/m4c_textcaps/m4c_captioner.yml#L6. Please refer to the annotations files of the train/val/test directory.

nmalboubi commented 3 years ago

Thank you. Is there some documentation I can reference on the link you posted? Are the .npy files under imdb, the image features? And the train/val/test files above imdb are the OCR extractions? Any documentation on getting the data into these formats?

guanghuixu commented 3 years ago

Actually, these files are following the original TextVQA/Textcap dataset, I didn't see other API to obtain them. But it should be very easy since it just contains some necessary information, such as the feature embedding and bbox coordinate ...

hulexing1008 commented 3 years ago

Thank you. Is there some documentation I can reference on the link you posted? Are the .npy files under imdb, the image features? And the train/val/test files above imdb are the OCR extractions? Any documentation on getting the data into these formats?

Excuse me, have you found a way to use the model to get the caption for a sample image? Thanks!

Caroline0728 commented 2 years ago

Thank you for the swift reply. To make sure I'm looking at the correct files, can you please post the links to the "annotation files in the original dataset directory"? Are they the imdb .npy files found here:

https://github.com/guanghuixu/AnchorCaptioner/releases/tag/data?

Hello, I have seen your problem, I have a similar problem, if I want to add another method to extract features, but the file format generated in the end needs.npy and info.npy, is there any source code for reference in extracting features of this code?

guanghuixu / AnchorCaptioner

How can I use the pre-trained M4C Captioner on a sample image? #2