How was the LITE dataset sampled?

unsplash / datasets

🎁 5,400,000+ Unsplash images made available for research and machine learning

https://unsplash.com/data

2.43k stars 121 forks source link

How was the LITE dataset sampled? #55

Closed cpeukert closed 9 months ago

cpeukert commented 9 months ago

First off, thanks a lot for making the data available - it's a tremendous service to the research community!

@TimmyCarbone, I have a question regarding the relationship between LITE and FULL. From what I understand, the LITE dataset is a subset of the FULL dataset. How were the 25k images in the first release of the LITE dataset selected? And how did you select the images that were added to replace removed images in subsequent releases? Thanks!

TimmyCarbone commented 9 months ago

Hi @cpeukert.

Really glad to see it being used and worked with ! 🙏

The Lite Dataset is a subset of the Full Dataset that contains 25k curated photos (photos that were manually reviewed and published on Unsplash homepage at any point in the past), mainly focused on nature content.

For each release of the dataset, we get rid of the removed images in the Lite Dataset and replace them by new ones. Older releases of the Dataset won't be edited so they are more likely to contain removed photos.

A new release should be coming by end of March.

Hope this answers your questions!

cpeukert commented 9 months ago

Thanks a lot! This is very helpful.

Just a follow-up question. I understand that there are about 290k curated photos in the Full dataset. Can you tell me specifically how you chose the 25k curated photos to be put into the Lite dataset? For example, did you first filter on all curated photos, then on nature content (how exactly? based on keywords?), and did you then take a random set of 25k? I'm asking because I want to study the performance differences of ML algorithms trained on Lite versus Full, and whether any differences can be explained by fundamental differences in the training data (Lite vs Full).

TimmyCarbone commented 9 months ago

It's a random set of 25k curated nature (based on keywords, I believe the nature keyword has to be in there). Both filters (curated and nature) happen at the same time.

cpeukert commented 9 months ago

Thanks!