taokz / BiomedGPT

BiomedGPT: A Unified and Generalist Biomedical Generative Pre-trained Transformer for Vision, Language, and Multimodal Tasks
Apache License 2.0
363 stars 34 forks source link

Some questions about recall and precision. #13

Closed sjjzhuri closed 6 months ago

sjjzhuri commented 6 months ago

Hello, I would like to understand how you calculate recall and precision when computing F1. Is it done based on sets of words or how is it obtained? Thank you.

taokz commented 6 months ago

Hi @sjjzhuri Do you mean F1 score for VQA task? If it is, we treat VQA as classification, and there should be several answer candidates. So the calculation of precision, recall and F1 will be aligned with multi-class classification problem. In implementation, I directly use sklearn.metrics.f1_score

sjjzhuri commented 6 months ago

Hi @sjjzhuri Do you mean F1 score for VQA task? If it is, we treat VQA as classification, and there should be several answer candidates. So the calculation of precision, recall and F1 will be aligned with multi-class classification problem. In implementation, I directly use sklearn.metrics.f1_score

Hello, thank you very much for your response. I did inquire about the visual question answering (VQA) task. I would like to know how you create the candidate file. Do you use all the answers from the test set as candidate answers? Are there any differences in the candidates when calculating metrics for closed-ended questions and open-ended questions? Additionally, could you share how you select the answer from the candidates? Do you map the sentences to a fixed-dimensional vector and then calculate similarity scores between the generated vector and vectors for each candidate answer, or do you use some other method, perhaps solely based on word calculations? I've recently been confused by various methods, and I sincerely hope you can help clarify these questions. Thank you very much.

taokz commented 6 months ago
  1. Do you use all the answers from the test set as candidate answers? The candidate answers come from train set, no information from test set is used.

  2. Are there any differences in the candidates when calculating metrics for closed-ended questions and open-ended questions? The is no difference to calculate closed-ended question and open-ended question, both of closed- and open-ended QA can be treated as multi-classification problem.

  3. Additionally, could you share how you select the answer from the candidates? Do you map the sentences to a fixed-dimensional vector and then calculate similarity scores between the generated vector and vectors for each candidate answer, or do you use some other method, perhaps solely based on word calculations? The model perform auto-regressive manner to generate the answer, to constrain the generation within the candidate set, trie-based beam search is used (see Extended Figure 1d in the paper for illustrative explanation). The calculating similarity you mentioned should be how CLIP-based models do instead of how GPT-style model does (see Figure 5 for reference). In addition, I want to mention that in the zero-shot setting without fine-tuning, I did not set any answer candidates, but just let it generate free-form response.