hendrycks / emodiversity

Wellbeing and Emotion Prediction (NeurIPS 2022)
MIT License
8 stars 0 forks source link

Create a zip or tar of the dataset #1

Closed xksteven closed 2 years ago

xksteven commented 2 years ago

We need to zip the current version of the dataset and find some place to host it.

The data is currently stored on rainbowquartz. Specifically This csv lists all of the training data: /data/erictang000/video_wellbeing_beta/emotions_data/train_60k_filtered.csv

While this csv lists all of the test data: /data/erictang000/video_wellbeing_beta/emotions_data/test_60k_filtered.csv

xksteven commented 2 years ago

The labels are stored here: /data/erictang/emotions/data_downsampled_256/results.csv

xksteven commented 2 years ago

Also double check that the labels even make sense anymore. I'm asking this since you might've changed the numbers?

Here's how they were being preprocessed

    labels = {}
    with open(label_csv_path) as csvfile:
        reader = csv.reader(csvfile)
        for i, row in enumerate(reader):
            if i == 0:
                continue  # skip first row
            label = [float(x)/10 for x in row[2:29]]  # divide by 10 to convert to probability
            labels[int(row[1])] = label  # index by number in video name
xksteven commented 2 years ago

The labels for V2V are located here:

/data/erictang000/video_wellbeing_beta/wellbeing_data/listwise2_20k.csv

It will be probably good to just duplicate the videos and have a separate train/test split for V2V datasets as well. My thinking behind this is that some users will want to test on only one dataset or the other.

For users who want to download both datasets together we can just give have a link to the labels for V2V separately that we provide.

Feel free to make suggestions on how you think it could be best organized.

JunShern commented 2 years ago

Ok here's my plan for the two datasets:

VCE

vce_dataset/
├── metadata.json
├── train_labels.json
├── test_labels.json
├── train/00000.mp4
├── train/00001.mp4
...
├── train/49999.mp4
├── test/50000.mp4
├── test/50001.mp4
...

where

V2V

v2v_dataset/
├── metadata.json
├── train_labels.json
├── test_labels.json
├── train/00000.mp4
├── train/00001.mp4
...
├── train/49999.mp4
├── test/50000.mp4
├── test/50001.mp4
...

where everything has the same form as the VCE dataset, except that

xksteven commented 2 years ago

As a note we'll want to recollect the numbers for the paper. Such as how many examples are in each dataset, how many example pairs, etc.

xksteven commented 2 years ago

Possibilities for test set:

/data/erictang000/video_wellbeing_beta/wellbeing_data/harder_test2.csv or /data/erictang000/video_wellbeing_beta/wellbeing_data/harder_test3.csv

train set seems to be:

/data/erictang000/video_wellbeing_beta/wellbeing_data/harder_train2.csv

This seems to be the listwise path:

/data/erictang000/video_wellbeing_beta/wellbeing_data/listwise2_20k.csv

Would be good to double check nothing from the train overlaps with test from either the train or listwise sets.

xksteven commented 2 years ago

Eric responded:

"looks like it was harder_train4 and harder_test4"

Info about listwise:

"The videos in the listwise dataset don't appear in either the train or test it's like a separate test set since we don't train on listwise comparisons."

JunShern commented 2 years ago

I cannot find harder_train4 and harder_test4 in the directory though, there's only up to train3 and test3.

JunShern commented 2 years ago

Alright so I've got both the VCE and V2V datasets ready and I've also checked a bunch of labels manually to make sure the exports are right.

VCE has

train: 50000 labelled videos
test: 11046 labelled videos
metadata: 61406 items

Total videos: 61046

V2V has

train: 11115 comparisons on 16279 unique videos
test: 3112 comparisons on 6069 unique videos
listwise: 1758 comparisons on 4322 unique videos
{train}, {test}, {listwise} videos are mutually exclusive.

Total videos: 26670

Two things to note:

xksteven commented 2 years ago

harder_train4.csv harder_test4.csv

JunShern commented 2 years ago

Cool! Updated V2V to use these two new files:

train: 11038 comparisons on 16125 unique videos
test: 3189 comparisons on 6223 unique videos
listwise: 1758 comparisons on 4322 unique videos

Total videos: 26670
JunShern commented 2 years ago

Both VCE and V2V datasets have been uploaded here (by Dan): https://drive.google.com/drive/folders/1sRKitbXpLZ4pwXTONjiA-X0Y1z4I2o4X

(Documentation of those datasets is already in our README https://github.com/hendrycks/emodiversity#readme)

Marking this issue as closed.