wangxw5 / wikiDiverse

Apache License 2.0
32 stars 3 forks source link

Missing data #4

Open starreeze opened 1 year ago

starreeze commented 1 year ago

Hello, we're following your work, but unfortunately we've run into some difficulties: we found that the images and entity2img file you provided cannot cover the mentions with 10 candidates. We don't konw if we made some mistakes when using your datasets. I'd appreciate it if you could offer some help. Thanks in advance.

  1. Missing entity2img entries. For example, in the 1st mention in test_w_10cands.json, there's a candidate "https://en.wikipedia.org/wiki/Superior_court_(Canada)". However, in wikipedia_entity2imgs.tsv searching for "Superiorcourt(Canada)" returned nothing, while the wiki page actually contains (a lot of) images. image

  2. Corrupted images. Some images are not well-downloaded: image

wangxw5 commented 1 year ago

Hi, thank you for your question.

For the first question, the entity2img mapping table doesn't contain all the entities from Wikipedia due to some reasons like network failure. But it contains all the image information we have obtained. And if no image can be found, an all-zero matrix will be used as the image tensor in our work.

For the second question, your findings are correct. For the availability of wikinews imgs, I have done statistical and manual checks. I forgot the exact ratio now, since it's been a long time. But I remember that the ratio of unreadable images on the mention side is very small. And they are indeed impossible to obtain after manual verification. (e.g., see https://en.wikinews.org/wiki/Huge_Gay_Pride_parade_held_in_Brazil)

starreeze commented 1 year ago

Thank you very much for your reply.

Yes you're right. According to our newly-made statistics, only less than 10 images are unreadable. For the first problem, we'll try to solve it anyway.

Thanks again for your contribution.

starreeze commented 1 year ago

Hi, I have another question. I read through you paper and found that in the Entity Disambiguation stage, the probability P(e_i|m) calculated in the first stage is used. However, in train_w_10cands.json you didn't give the probability, and that makes it hard for us to reproduce your results. Could you release P(e_i|m) on each candidate, or provide the code to calculate it? Thanks in advance.

wangxw5 commented 1 year ago

Hi, thank you again for your question. We will open up the raw wikipedia data in the near future, which can be used to calculate this statistic (because of the huge scale, it is being uploaded in shards now). But because we use some internal distributed database tools for calculation, we cannot share the code of this part for the time being. At the same time, we will also promote the open source of pre-calculated offline Pem data.

starreeze commented 1 year ago

Thanks a lot for your reply. A last question (hopefully). When reading your paper, I don't quite understand how the metrics, such as F1, precision and recall, are calculated on this task. This is the result you reported in the appendix. image Do you calculate these metrics on top-1? R@5 means top-5 accuracy? If not, how are they calculated? I'd appreciate it if you could offer some help.

wangxw5 commented 1 year ago

Sorry for replying to you so late. `Do you calculate these metrics on top-1? R@5 means top-5 accuracy?'->Yes, they are calculated on Top1. And R@5 means the Top-5 recall.