lalithjets / Surgical_VQA

Surgical Visual Question Answering. A transformer-based surgical VQA model. Offical Implementation of "Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformers", MICCAI 2022.
47 stars 10 forks source link

position encoding #2

Closed Flaick closed 1 year ago

Flaick commented 1 year ago

Hello, I am wondering if you guys use any 2D positional encoding to add to the visual feature tokens? If not, is there any reason why? Thanks!

lalithjets commented 1 year ago

Hi Flaick,

We didn't explicitly use pos embedding for visual tokens. We used the default embedding from the VisualBert model. We are currently exploring the effects of pos embedding.

For default embedding, you can look through the (official huggingface VisualBert code, line 65-194.

Have a good day.