YFCC100M Caption Preprocessing

normster commented 3 years ago

Hi,

I'm trying to reproduce the YFCC100M results and would like to know how image captions were preprocessed during training. For instance, how was the caption for the following sample extracted?

Title: denise%27s+peanut+chicken Description: recipe+here%3A+%3Ca+href%3D%22http%3A%2F%2Fallrecipes.com%2FRecipe%2FDenises-Peanut-Chicken%2FDetail.aspx%22%3Eallrecipes.com%2FRecipe%2FDenises-Peanut-Chicken%2FDetail.aspx%3C%2Fa%3E%0Ai+added+1%2F2+teaspoon+of+sriracha+hot+chili+sauce+and+used+1+TBS+of+chunky+peanut+butter+in+place+of+the+2+cups+of+peanuts.

Best, Norman

naveenkumarmarri commented 3 years ago

You can decode as follows.

import urllib.parse
title = "denise%27s+peanut+chicken"
print(urllib.parse.unquote_plus(title))
# prints the following
"denise's peanut chicken"

TonyLianLong commented 3 years ago

@jongwook In addition to the exact parsing method, could you tell us how a title is merged with the corresponding description text?

openai / CLIP

YFCC100M Caption Preprocessing #103