Closed nmalboubi closed 2 years ago
Do you want to use a pre-trained model to generate captions for any given images (not train/val/test)?
First, you have to use an OCR system to recognize texts in the image, and a detection model to detect visual objects. To prepare dataset as the format required by the code, so that the code can run it normally.
Whether our method or M4C is concentrated in the downstream reasoning task, so you need to prepare the multimodal features first.
@guanghuixu yes, exactly! pre-trained model to generate captions for any given images. I have the OCR system to pull the raw text from the image. As for the detection model, I'm using
https://github.com/facebookresearch/mmf/blob/master/tools/scripts/features/extract_resnet152_feat.py
to extract the image features. However, I'm not sure what to do after this step. Where would I feed these two pieces of information to run the AnchorCaptioner or M4C?
Just imitate the original dataset annotation approach to generate a new validation set, which contains the images you want to process. So you need to look at the annotation files in the original dataset directory. For example, for visual features, there will be two annotation files, imageid.npy and imageid_info.npy. After generating the required annotation files, you only need to put them in the corresponding directory.
Thank you for the swift reply. To make sure I'm looking at the correct files, can you please post the links to the "annotation files in the original dataset directory"? Are they the imdb .npy files found here:
https://github.com/guanghuixu/AnchorCaptioner/releases/tag/data?
https://github.com/guanghuixu/AnchorCaptioner/blob/main/configs/captioning/m4c_textcaps/m4c_captioner.yml#L6. Please refer to the annotations files of the train/val/test directory.
Thank you. Is there some documentation I can reference on the link you posted? Are the .npy files under imdb, the image features? And the train/val/test files above imdb are the OCR extractions? Any documentation on getting the data into these formats?
Actually, these files are following the original TextVQA/Textcap dataset, I didn't see other API to obtain them. But it should be very easy since it just contains some necessary information, such as the feature embedding and bbox coordinate ...
Thank you. Is there some documentation I can reference on the link you posted? Are the .npy files under imdb, the image features? And the train/val/test files above imdb are the OCR extractions? Any documentation on getting the data into these formats?
Excuse me, have you found a way to use the model to get the caption for a sample image? Thanks!
Thank you for the swift reply. To make sure I'm looking at the correct files, can you please post the links to the "annotation files in the original dataset directory"? Are they the imdb .npy files found here:
https://github.com/guanghuixu/AnchorCaptioner/releases/tag/data?
Hello, I have seen your problem, I have a similar problem, if I want to add another method to extract features, but the file format generated in the end needs.npy and info.npy, is there any source code for reference in extracting features of this code?
I've loaded the data and pre-trained modules by following the installation instructions. However, for prediction it seems like the focus is only on val/test data. How would I just use the model to get the caption for a sample image? Can you share any resources on this?