jbdel / vilmedic

ViLMedic (Vision-and-Language medical research) is a modular framework for vision and language multimodal research in the medical field
MIT License
151 stars 20 forks source link

Does MVQA not using text information? #5

Closed GanjinZero closed 1 year ago

GanjinZero commented 2 years ago

It seems MVQA inputs a image and do label classification? Question information is discarded?

jbdel commented 1 year ago

Hello,

Actually yes, the best performing model do not encode the questions. MVQA task right now is a bit of a dummy task. Questions are like "what is the abnormality in the image" most of the time. There are some recent paper that try to create better dataset: https://openreview.net/pdf?id=uH_RlkvQMUs