unsplash / datasets

🎁 5,400,000+ Unsplash images made available for research and machine learning
https://unsplash.com/data
2.43k stars 121 forks source link

Explanation on the ai_service_2_confidence column in keywords.tsv000 (range seems weird) #39

Closed jeanmidevacc closed 3 years ago

jeanmidevacc commented 3 years ago

Describe the bug Hello ,

I was looking on the data from the lite dataset this morning and I noticed something weird in the column 'ai_service_2_confidence' from the keywords.tsv000 file.

when I applied some stats on the columns about ai_service the column 'ai_service_2_confidence' seems to have extreme value that are exceeding 100 that is for me the expected max (if I take the ai_service_1_confidence as reference for exemple)

image

To Reproduce

There is the code to build the stats

import pandas as pd
dfp_keywords_raw = pd.read_csv('keywords.tsv000', sep='\t', header=0)
dfp_keywords_raw[['ai_service_1_confidence', 'ai_service_2_confidence']].describe()

Steps to reproduce the behavior: Having a python environment (3.6.13) with pandas 1.1.5 installed

Expected behavior I am expecting to have a value in the column 'ai_service_2_confidence' in keywords.tsv000 file between 0 and 100 or if it's not the case having a more precise description of the value for the 'ai_service_2_confidence' in the description (like the range)

Additional context I have a list of the keywords that seems to be impacted by these extreme values unsplash_extreme_value.zip

Hope that it will help on your investigation πŸ•΅οΈβ€β™€οΈ (and I hope that is not just me that is missing something)

PS: your dataset is great by the way (really hope to have access to the full version soon)πŸ‘

TimmyCarbone commented 3 years ago

@jeanmidevacc I've looked into it and it looks like you can divide the values that are > 100 by 100. For example, if you see confidence = 9657.65, the actual confidence in a range 0-100 is 96.5765.

This is obviously an issue in the dataset and I'm adding this fix to the next release that's coming up this week.

Thank you for catching it and for describing the issue the way you did!

jeanmidevacc commented 3 years ago

Great thanks for the update (and to have handle quickly the issue)