openimages / dataset

The Open Images dataset
https://storage.googleapis.com/openimages/web/index.html
Apache License 2.0
4.25k stars 607 forks source link

Distinct labels with same description in dict.csv #31

Open chcomin opened 7 years ago

chcomin commented 7 years ago

Hi, I noticed that some labels have the same description in the file dict.csv. Is that expected? Should these cases be treated as distinct entities or is it better to merge them into a single label?

The list of repeated descriptions is:

/m/018w8, /m/0frqg3, basketball /m/0449p, /m/0h5wslk, jaguar /m/07ptj3n, /m/08gqpm, cup /m/03_mn6, /m/0j67plv, lotus /m/02g0fy, /m/0f5mx3, fiat 500 /m/019v__, /m/0j6n39d, tvr /g/121sl9wl, /m/02jcyv, jaguar s-type /m/01tv9, /m/08p92x, cream /m/02m57w, /m/0ds4250, groom /m/080cpdy, /m/0fq0fnf, plank /m/01l849, /m/025rs2z, gold /m/04bct1, /m/0dzdr, chest /m/01txr2, /m/01ty0n, spring /m/05b1swx, /m/0633h, polar bear /m/03c_kl, /m/0l_yv, snowshoe /m/028ygt, /m/02jwq3, punch /m/01d380, /m/02hhhb, drill /m/015zzv, /m/03bxt6z, runway /m/018xm, /m/0dpm1v, ball /m/0jqjp, /m/0lxkm, iris /m/0319l, /m/04lmyz, horn /m/025rw19, /m/03cld36, iron /m/07bg4p, /m/095_n, heart /m/017cc, /m/04n0b__, brain /m/091410, /m/09141t, collar /m/01z9v6, /m/054fyh, pitcher /m/01443y, /m/0cphhk, headgear /m/02tcwp, /m/031vtq, /m/03hqlh, trunk /m/0879r3, /m/09gys, squid /m/02g387, /m/033cnk, egg /m/011_f4, /m/0d8lm, string instrument /m/01fpbm, /m/04c38s, daisy /m/01jh3, /m/0h5wwjv, subaru /m/02g7g2, /m/0cjs7, asparagus /m/04mtl, /m/0h5x4j3, lamborghini /m/09xp, /m/09xqv, cricket /m/03xr7y, /m/04gth, lavender /m/0266skk, /m/0m775, tilapia /m/01qk4t, /m/0l14v3, conch /m/020lf, /m/04rmv, mouse /m/05h2v35, /m/0by3w, jumping /m/06wrt, /m/0cc6_9k, sailing /m/03qsdpk, /m/05npqn, theatre /m/02519, /m/07s6bqg, cable car /m/031n9j, /m/04tdh, marble /m/02qsq1, /m/03bx3wh, corn on the cob /m/01cjsf, /m/02zt3, kite /m/013y0j, /m/013y1f, organ /m/06g1w2, /m/0hwky, pattern /m/0cyhj, /m/0jc_p, orange /m/0gqbt, /m/0j3gthp, shrub /m/06ff5p, /m/0b209p, rolls-royce corniche /m/0fsg8, /m/0m150, harrier /m/01lbxg, /m/039hvj, nut /m/026y54h, /m/02823g9, /m/0bzfym, alfa romeo giulietta /m/04f6rz, /m/0fgkh, turquoise /m/027y004, /m/0cqdf, sponge /m/01226z, /m/02vx4, football /m/01m0p1, /m/0jwr9, cardinal /m/07_l0f, /m/0gzznm, powder /m/03clckp, /m/083vt, wood /m/04d01f, /m/0pbc, amber /m/07pbfj, /m/0ch_cf, fish /m/0gccln, /m/0gccmf, ford model a /m/01b7b, /m/027k49j, bishop /m/0151b0, /m/07jx7, triangle /m/01brf, /m/04_10ss, bronze /m/014sg5, /m/07_l6, viola /m/08g_yr, /m/0cx45, temple /m/01c43w, /m/01v50j, crane /m/03r18y, /m/0dj6p, peach /m/03wfhdl, /m/0y8r, armored car /m/04ffcj, /m/0k354, lilac /m/06s7q8, /m/0k2jq, sabre

chcomin commented 7 years ago

Actually, it seems that some of the descriptions (e.g., 'basketball') are indeed related to distinct concepts while others (e.g., 'alfa romeo giulietta') seem to describe the same thing.

rkrasin commented 7 years ago

Hi @chcomin,

thank you for the find. You're correct, there are two classes of errors:

  1. Different entities, same description. Like, /m/020lf, /m/04rmv, mouse In one case, it's a computer mouse, and in the other case, it's an animal.

    For this kinds of collisions, I would propose to make the descriptions more verbose. Like "mouse" -> "computer mouse", "mouse" -> "mouse (animal)". Feel free to make a pull request, and I will try to advocate for its acceptance.

  2. Same real entities, same description, different ids (like "alfa romeo giulietta'). A short term fix would be to modify labels so that these entities also have the same images attached. Eventually, there shall be chosen a winner, but I don't have enough information to give an informed advice here.

SlipknotTN commented 7 years ago

Checking the test images here http://openimages.oldjpg.com/, I see that sometimes the duplicates classes are actually the same (e.g. egg) but other times no. For example "mouse" as already said, but also "fish" (one is the animal and the other one is food).

Please notice that the 3 "alfa romeo giulietta" are:

So resolving all the duplicates would be a useful work, but we have to check all the classes, a simple merge could be wrong.

rkrasin commented 7 years ago

I agree. Let me check, if it's a good time to do with the Google team.