xksteven commented 2 years ago

We need to zip the current version of the dataset and find some place to host it.

The data is currently stored on rainbowquartz. Specifically This csv lists all of the training data: /data/erictang000/video_wellbeing_beta/emotions_data/train_60k_filtered.csv

While this csv lists all of the test data: /data/erictang000/video_wellbeing_beta/emotions_data/test_60k_filtered.csv

xksteven commented 2 years ago

The labels are stored here: /data/erictang/emotions/data_downsampled_256/results.csv

xksteven commented 2 years ago

Also double check that the labels even make sense anymore. I'm asking this since you might've changed the numbers?

Here's how they were being preprocessed

    labels = {}
    with open(label_csv_path) as csvfile:
        reader = csv.reader(csvfile)
        for i, row in enumerate(reader):
            if i == 0:
                continue  # skip first row
            label = [float(x)/10 for x in row[2:29]]  # divide by 10 to convert to probability
            labels[int(row[1])] = label  # index by number in video name

xksteven commented 2 years ago

The labels for V2V are located here:

/data/erictang000/video_wellbeing_beta/wellbeing_data/listwise2_20k.csv

It will be probably good to just duplicate the videos and have a separate train/test split for V2V datasets as well. My thinking behind this is that some users will want to test on only one dataset or the other.

For users who want to download both datasets together we can just give have a link to the labels for V2V separately that we provide.

Feel free to make suggestions on how you think it could be best organized.

JunShern commented 2 years ago

Ok here's my plan for the two datasets:

VCE

vce_dataset/
├── metadata.json
├── train_labels.json
├── test_labels.json
├── train/00000.mp4
├── train/00001.mp4
...
├── train/49999.mp4
├── test/50000.mp4
├── test/50001.mp4
...

where

train/ and test/ directories contain the MP4 video files

train_labels.json and test_labels.json contain labels for their respective videos like

{
"00001": {
    "emotions": {
        "Admiration": 1.4166666666666667, # This is the average "intensity" score (rated 1-10) given by annotators who selected this emotion
        "Adoration": 0.0,
        "Aesthetic Appreciation": 5.583333333333333,
        "Amusement": 1.4166666666666667,
        "Anger": 0.0,
        "Anxiety": 0.0,
        "Awe (or Wonder)": 1.1666666666666667,
        "Awkwardness": 0.0,
        "Boredom": 0.0,
        "Calmness": 0.0,
        "Confusion": 0.0,
        "Craving": 0.8333333333333334,
        "Disgust": 0.0,
        "Empathic Pain": 0.0,
        "Entrancement": 0.0,
        "Excitement": 0.0,
        "Fear": 0.0,
        "Horror": 0.0,
        "Interest": 0.8333333333333334,
        "Joy": 0.0,
        "Nostalgia": 0.0,
        "Relief": 0.0,
        "Romance": 0.0,
        "Sadness": 0.0,
        "Satisfaction": 1.5,
        "Sexual Desire": 0.0,
        "Surprise": 0.0
    },
    "topK": [
        "Aesthetic Appreciation",
        "Satisfaction",
        "Admiration"
    ]
},
"00002": {
...
}

metadata.json contains metadata for all videos, e.g.

{
"00000": {
    "codec_name": "h264",
    "duration": "12.000000",
    "file": "train/00000.mp4",
    "frame_rate": "30/1",
    "height": 320,
    "number_of_frames": "360",
    "width": 256
},
"00001": {
...
}

V2V

v2v_dataset/
├── metadata.json
├── train_labels.json
├── test_labels.json
├── train/00000.mp4
├── train/00001.mp4
...
├── train/49999.mp4
├── test/50000.mp4
├── test/50001.mp4
...

where everything has the same form as the VCE dataset, except that

train_labels.json and test_labels.json contain a list of preference-ordered comparisons (most-preferred to least-preferred):

{
"comparisons": [
    ["08711", "00842", "22249"],
    ["25894", "58217", "22029", "22249"],
    ["53147", "02989", "11888"],
    ["32206", "06875", "61492"],
    ["26382", "31415", "25377", "07105"],
    ...
]
}

And train/ and test/ only contain a subset of the videos from VCE that have V2V labels.

xksteven commented 2 years ago

As a note we'll want to recollect the numbers for the paper. Such as how many examples are in each dataset, how many example pairs, etc.

xksteven commented 2 years ago

Possibilities for test set:

/data/erictang000/video_wellbeing_beta/wellbeing_data/harder_test2.csv or /data/erictang000/video_wellbeing_beta/wellbeing_data/harder_test3.csv

train set seems to be:

/data/erictang000/video_wellbeing_beta/wellbeing_data/harder_train2.csv

This seems to be the listwise path:

/data/erictang000/video_wellbeing_beta/wellbeing_data/listwise2_20k.csv

Would be good to double check nothing from the train overlaps with test from either the train or listwise sets.

xksteven commented 2 years ago

Eric responded:

"looks like it was harder_train4 and harder_test4"

Info about listwise:

"The videos in the listwise dataset don't appear in either the train or test it's like a separate test set since we don't train on listwise comparisons."

JunShern commented 2 years ago

I cannot find harder_train4 and harder_test4 in the directory though, there's only up to train3 and test3.

JunShern commented 2 years ago

Alright so I've got both the VCE and V2V datasets ready and I've also checked a bunch of labels manually to make sure the exports are right.

VCE has

train: 50000 labelled videos
test: 11046 labelled videos
metadata: 61406 items

Total videos: 61046

V2V has

train: 11115 comparisons on 16279 unique videos
test: 3112 comparisons on 6069 unique videos
listwise: 1758 comparisons on 4322 unique videos
{train}, {test}, {listwise} videos are mutually exclusive.

Total videos: 26670

Two things to note:

From looking through the labels manually, I'm 90% sure that the V2V comparisons for train/test/listwise are ordered from least-preferred to most-preferred. I checked over 20 comparisons and this was true for all of them. But we should double-check with Eric to confirm.
The train/test/listwise splits of V2V do not correspond to the same train/test of VCE. This is fine, we don't need them to be the same. But I think it's helpful to allow people to reuse the same videos for V2V and VCE if they want to, so for both datasets I'll simply lump all the videos into a videos/ folder instead of train/ and test/, and the labels.json files will be used to identify which videos belong to which splits.

xksteven commented 2 years ago

harder_train4.csv harder_test4.csv

JunShern commented 2 years ago

Cool! Updated V2V to use these two new files:

train: 11038 comparisons on 16125 unique videos
test: 3189 comparisons on 6223 unique videos
listwise: 1758 comparisons on 4322 unique videos

Total videos: 26670

JunShern commented 2 years ago

Both VCE and V2V datasets have been uploaded here (by Dan): https://drive.google.com/drive/folders/1sRKitbXpLZ4pwXTONjiA-X0Y1z4I2o4X

(Documentation of those datasets is already in our README https://github.com/hendrycks/emodiversity#readme)

Marking this issue as closed.

hendrycks / emodiversity

Create a zip or tar of the dataset #1

VCE

V2V