HMC predictions on CIFAR-C, MedMNIST, and Diabetic Retinopathy Dataset

PaulScemama commented 1 year ago

Hi!

Are predictions made available for the CIFAR-C, MedMNIST, and Diabetic Retinopathy Datasets?

izmailovpavel commented 1 year ago

Hi @PaulScemama! I just added the HMC predictions and the competition scoring script here: https://github.com/izmailovpavel/neurips_bdl_starter_kit/tree/main/eval-phase. Please let me know if you run into any issues using it.

Note that we ended up dropping the Diabetic Retinopathy dataset from the competition, so we have not released the scripts for it.

PaulScemama commented 1 year ago

Amazing! Thank you so much. I will keep you updated if I run into any issues. Thanks again.

PaulScemama commented 1 year ago

@izmailovpavel the cifar10 HMC predictions in cifar_probs.csv have shape (50000, 10). I believe cifar10 has 60,000 samples -- 10,000 of which are for the test set. So are these the HMC predictions on the training set?

If so, where would I find HMC predictions on the test set?

Best, Paul

Edit: I figured it out -- got the 10,000 HMC predictions for the test set. Thank you! I will reach out if I run into any more trouble.

izmailovpavel commented 1 year ago

Hey @PaulScemama, the CIFAR predictions are for the test data, but we also apply corruptions from CIFAR-10-C. You can get the exact copy of the evaluation data we used following this notebook.

The first 10k examples should represent the original CIFAR-10 test set. In the next 40k examples, we have 4 copies of the test set with various corruptions.

Note that if you want the results to be comparable with the competition results, you should use the data we provided, and evaluate on all 50k points.

If you are just generally interested in HMC samples on CIFAR-10, we also have the data for the development phase here with instructions in this notebook

PaulScemama commented 1 year ago

Hi @izmailovpavel, thanks again for discussing this with me. It is really a great help.

This is my current understanding:

"If you are just generally interested in HMC samples on CIFAR-10, we also have the data for the development phase here with instructions in this notebook"

This link and notebook contain/use the original CIFAR-10 trainset and testset. The HMC sampler was trained using this trainset. The probs.csv in the link ("here") contains the HMC predictions for the original CIFAR-10 testset.

"the CIFAR predictions are for the test data, but we also apply corruptions from CIFAR-10-C. You can get the exact copy of the evaluation data we used following this notebook.

The first 10k examples should represent the original CIFAR-10 test set. In the next 40k examples, we have 4 copies of the test set with various corruptions."

The link and notebook here contain/use the original CIFAR-10 trainset, but the testset is augmented with 40k more examples that come from CIFAR-10-C (corruptions). The HMC predictions on all 50K (original testset augmented with 40k corruptions) is found here.

let me know if I'm getting any of this wrong. And thanks again!

izmailovpavel commented 1 year ago

Hey @PaulScemama, yes, that is correct!

PaulScemama commented 1 year ago

@izmailovpavel great - thank you so much! I will keep this issue open for the time being if anything else comes up but you've been a great help.

I'd also like to say I've read many of your papers and admire the work you've done! I'm currently working as an ML engineer after finishing undergrad last year, but I'm fairly convinced I will pursue more school in the coming years, and one of the reasons for feeling that way is reading papers like yours and others from your group.

Have a good weekend

izmailovpavel commented 1 year ago

@PaulScemama sounds good!

Thank you so much for your kind words, I really appreciate it! Best of luck with your career and research!

PaulScemama commented 1 year ago

I'm back for an update / new question @izmailovpavel :)

To summarize the data I have:

The Cifar10 data:

50k training data and 10k original test data -- found under the "Development Phase" section on the main site
50k training data and 50k test data where the training data is the same as (1.), the first 10k of the test data are the original test data in (1.), and the next 40k are 4 different corrupted versions of the test data. These are all found in the cifar_anon.npz file which can be found under the "Evaluation Phase" section on the main site

I've checked that the first 10k of the 50k test data in (2.) matches with the 10k original test data in (1.). So we're all good there.

The HMC predictions:

There are 10k HMC outputs found here under probs.csv
There are 50k HMC outputs found here under cifar_probs.csv

My assumption was that the 10k HMC outputs in (1.) would match the first 10k HMC outputs in (2.), but this does not appear to be the case. Instead it appears as though, combining (1.) and (2.) we have 60k unique HMC outputs; however, we only have 50k unique test inputs.

Where does this extra 10k HMC outputs come from?

Note: I looked at the accuracy of each set of HMC predictions (60k in total) on the test set labels:

The 10k in (1.) get 91.44% accuracy
The first 10k in (2.) get 87.97% accuracy
The second 10k in (2.) get 41.93% accuracy
The third 10k in (2.) get 86.35% accuracy
The fourth 10k in (2.) get 65.29% accuracy
The fifth 10k in (2.) get 60.39% accuracy

Additionally, it would be helpful to know a mapping between type of corruptions and their corresponding indices on the 50k test data (e.g. "brightness v1 corruption : 10000:20000).

Do you know the answers to this or alternatively a document that explains it?

FYI -- once I get everything together I am willing to package everything up nicely in one .npz file with a description of the contents so that anyone who is in the same position as me can just download one thing.

izmailovpavel commented 1 year ago

Hey @PaulScemama!

The two sets of HMC probs both represent predictions on test, (1) for the development phase and (2) for the evaluation phase. The reason the first 10k don't match is that the predictions correspond to different models. Specifically, we used an AlexNet in the evaluation phase (corresponding to probs 2) and a ResNet-20 in the development phase (corresponding to probs 1).

Please let me know if you have questions!

For the mapping between the data and corruptions — I unfortunately don't have it. I will see if I can recover it.

Packaging everything in an .npz sounds great, thank you for doing it!

PaulScemama commented 1 year ago

That makes sense! Thank you. And thank you for looking into the corruptions.

I'll send over the .npz soon.

Thanks!

PaulScemama commented 1 year ago

Edit: found a bug in my code but may still have a question -- I'll get back to you tomorrow. Sorry about that!

PaulScemama commented 1 year ago

@izmailovpavel Okay I'm back 😅 -- so my (revised) question is:

I can't seem to get even close to 80% val/test accuracy with the AlexNet on Cifar10 (without any data augmentation). The only things I use are a cosine_decay_scheduler with sgd. I consistently get ~65% val/test accuracy with AlexNet on Cifar10. This is in stark contrast to what we've agreed the HMC sampler got on the uncorrupted testset which was ~87% (see above). I just want to make sure that the HMC outputs that get ~87% test accuracy are from the AlexNet backbone -- I'm kind of skeptical that they are because I can't find anywhere online that gets a test accuracy above the 70s. I could definitely be wrong, but if you could check that the HMC outputs, backbones, and testsets match up properly that would be great!

izmailovpavel commented 1 year ago

Hey @PaulScemama! Sorry, I was quite busy over the last two weeks.

I did a test run here: https://gist.github.com/izmailovpavel/438ed2fcf46b2ea5f8a8e7fac3daffc3

With our architecture on CIFAR-10 I am getting about 78% test accuracy on test with default SGD with pretty much zero parameter tuning. I also tried ensembling 3 solutions which gets about 81.3%. Now note that out HMC predictions correspond to an ensemble of thousands of HMC samples.

Similarly, in our original HMC BNN paper here in table 3, tunned SGD gets about 83% accuracy, while HMC get 89.6%. Note that a single HMC run corresponds to thousands of regular SGD runs in terms of compute.

So to sum up, to the best of my knowledge, the first 10000 samples correspond to the HMC AlexNet (here) predicted class probabilities on the CIFAR-10 test set.

PaulScemama commented 1 year ago

@izmailovpavel sorry I totally forgot to respond to this! Yes, you are right. Thank you for helping me with the example. I have the .npz file that I mentioned above, but it is a few megabytes. What is the best way to send it to you?

pavel-izmailov commented 1 year ago

Hey @PaulScemama, thank you for looking into this! Do you think you could email the npz to me? My nyu email is pi390@nyu.edu

PaulScemama commented 1 year ago

Sure! @pavel-izmailov. Also could you please let me know about the types of corruptions used for evaluating the HMC predictions? In particular, we have 50k HMC outputs found here under cifar_probs.csv.

The first 10k in (2.) get 87.97% accuracy
The second 10k in (2.) get 41.93% accuracy
The third 10k in (2.) get 86.35% accuracy
The fourth 10k in (2.) get 65.29% accuracy
The fifth 10k in (2.) get 60.39% accuracy

The first 10k we've discussed is on the original Cifar10 testset (no corruptions). Can you let me know what datasets were used (there's a list of them here) to produce the second, third, fourth, and fifth 10k to get those corresponding accuracies?

Thanks so much! It is very much appreciated.

PaulScemama commented 1 year ago

@izmailovpavel I will email you later today when I get home from work so I can use gmail google drive to send a large file. In the meantime I thought I'd share the script for creating what I'm going to send you. If you follow the directions in the docstring you will (hopefully) manage to create the data quickly. Let me know if there are any issues! And apologies for the delay.

import numpy as np

# =============== Creating the new .npz file from Neurips Approx Bayes Competition =========== #
def create_npz_with_test_labels(
    original_npz_path: str, cifar10_test_labels_path: str, new_npz_path: str
) -> None:
    """
    The original "cifar_anon.npz" found at this link
    https://storage.googleapis.com/neurips2021_bdl_competition/evaluation_phase/cifar_anon.npz
    contains all 0s for `y_test`.

    This function replaces the (all 0s) `y_test` with with appropriate Cifar10 labels found at this link
    https://storage.googleapis.com/neurips2021_bdl_competition/cifar10_test_y.csv
    and then saves a new .npz file.

    Parameters
    ----------
    original_npz_path : str
        Path to the original cifar_anon.npz file
    cifar10_test_labels_path : str
        Path to the cifar10 labels
    new_npz_path : str
        Where to save the new .npz file

    Returns
    -------
        None
            Saves a new .npz file with the following files:
                - 'x_train': 50k training images for Cifar10.
                - 'y_train': 50k training labels for Cifar10.
                - 'x_test_v1': 10k test images from the Original Cifar10
                - 'x_test_v2': 10k test images from a corrupted version of Cifar10
                - 'x_test_v3': 10k test images from a corrupted version of Cifar10
                - 'x_test_v4': 10k test images from a corrupted version of Cifar10
                - 'x_test_v5': 10k test images from a corrupted version of Cifar10
                - 'y_test': 10k test labels for the test images of Cifar10 (all the same for each version of images).

    Usage
    -----
        To use this function,

        - download the "cifar_anon.npz" file at the first link provided above.
        - download the Cifar10 labels at the second link provided above.
        - provide the paths to where you downloaded to `original_npz_path`
          and `cifar10_test_labels_path` respectively.
        - provide a path to store the new .npz file as a result of running the function.
        - run function.

        The new .npz file created will contain `y_test` that corresponds to the correct labels
        for `x_test`.
    """
    original_data = np.load(original_npz_path)
    x_train = original_data["x_train"]
    y_train = original_data["y_train"]

    x_test = original_data["x_test"]
    # y_test is all ZEROS right now
    x_test_v1 = x_test[0:10000]
    x_test_v2 = x_test[10000:20000]
    x_test_v3 = x_test[20000:30000]
    x_test_v4 = x_test[30000:40000]
    x_test_v5 = x_test[40000:50000]

    # Load in labels from testset and duplicate it 5x
    y_test = np.loadtxt(cifar10_test_labels_path)

    # Save as npz
    # NOTE that the labels for the test set are all the same
    # for each corrupted version of the test set.
    np.savez(
        new_npz_path,
        x_train=x_train,
        y_train=y_train,
        x_test_v1=x_test_v1,
        x_test_v2=x_test_v2,
        x_test_v3=x_test_v3,
        x_test_v4=x_test_v4,
        x_test_v5=x_test_v5,
        y_test=y_test,
    )

# EXAMPLE USAGE
if __name__ == "__main__":
    create_npz_with_test_labels(
        original_npz_path="./cifar_anon.npz",
        cifar10_test_labels_path="./cifar10_test_y.csv",
        new_npz_path="new_data.npz",
    )

    data = np.load("./new_data.npz")

    x_train = data["x_train"]
    y_train = data["y_train"]

    x_test_v1 = data["x_test_v1"]
    x_test_v2 = data["x_test_v2"]
    x_test_v3 = data["x_test_v3"]
    x_test_v4 = data["x_test_v4"]
    x_test_v5 = data["x_test_v5"]

    y_test = data["y_test"]

    print(x_test_v1.shape)
    print(x_test_v2.shape)
    print(x_test_v3.shape)
    print(x_test_v4.shape)
    print(x_test_v5.shape)

izmailovpavel commented 1 year ago

Hey @PaulScemama , thank you so much for looking into it!

Also could you please let me know about the types of corruptions used for evaluating the HMC predictions?

Unfortunately, I no longer have this information. We have the dataset of actual preprocessed features (the x_test arrays) which could in theory possibly be matched to the original corrupted test data. But I currently don't have the information about what corruptions were used. Sorry about that!

Thank you so much for sharing the script, it looks great!

PaulScemama commented 1 year ago

@izmailovpavel you're right, I didn't think of that! I've found out what each "version" is:

v2: cifar10_corrupted/gaussian_noise_2
v3: cifar10_corrupted/brightness_3
v4: cifar10_corrupted/pixelate_4
v5: cifar10_corrupted/zoom_blur_5

How to check:


# This is the file created in the script I sent before
data = np.load("./data/train_test.npz")

x_test = data["x_test"]

x_test_v1 = x_test[0:10000]
x_test_v2 = x_test[10000:20000]
x_test_v3 = x_test[20000:30000]
x_test_v4 = x_test[30000:40000]
x_test_v5 = x_test[40000:50000]

def load_image_dataset(
    dataset_name: str,
    split: str,
    as_supervised: bool = True,
) -> Tuple[tf.data.Dataset, int]:
    """Load in a tensorflow image dataset.

    Parameters
    ----------
    dataset_name : str
        name of dataset -- must be one of ["cifar10", "cifar100", "mnist"]
    split : str
        split name -- must be one of ["test", "train"]
    as_supervised : bool, optional
        if True returns x,y as tuple. Otherwise returns as dict, by default True

    Returns
    -------
    Tuple[tf.data.Dataset, int]
        the loaded dataset, and the number of classes
    """
    dataset, dataset_info = tfds.load(
        dataset_name,
        split=split,
        as_supervised=as_supervised,
        with_info=True,
    )
    num_classes = dataset_info.features["label"].num_classes
    num_examples = dataset_info.splits[split].num_examples
    num_channels = dataset_info.features["image"].shape[-1]
    return (
        dataset,
        {
            "name": dataset_name,
            "num_classes": num_classes,
            "num_examples": num_examples,
            "num_channels": num_channels,
        },
    )

# Convert image dtype to float32
num_channels = 3
dataset_stats = ((0.49, 0.48, 0.44), (0.2, 0.2, 0.2))

# From https://github.com/google-research/google-research/blob/master/bnn_hmc/utils/data_utils.py
def img_to_float32(image, label):
    return tf.image.convert_image_dtype(image, tf.float32), label

def img_normalize(image, label):
    """Normalize the image to zero mean and unit variance."""
    mean, std = dataset_stats
    image -= tf.constant(mean, shape=[1, 1, num_channels], dtype=image.dtype)
    image /= tf.constant(std, shape=[1, 1, num_channels], dtype=image.dtype)
    return image, label

for name in ["gaussian_noise_2", "brightness_3", "pixelate_4", "zoom_blur_5"]:
    cifar10, info = load_image_dataset(f"cifar10_corrupted/{name}", "test")
    cifar10 = cifar10.map(img_to_float32).cache()
    cifar10 = cifar10.map(img_normalize)
    x, y = iter(cifar10).get_next()

    def any_hits(x):
        if (x.numpy() == x_test_v2[0]).all():
            return "v2"
        if (x.numpy() == x_test_v3[0]).all():
            return "v3"
        if (x.numpy() == x_test_v4[0]).all():
            return "v4"
        if (x.numpy() == x_test_v5[0]).all():
            return "v5"
    print(f"{name} is equivalent to {any_hits(x)}")

izmailovpavel commented 1 year ago

@PaulScemama That's awesome, thank you so much for doing this! I really appreciate you looking into this!

Btw, do you still plan on sending the .npz to me by email? Or should I just run your script above?

Thank you!

PaulScemama commented 1 year ago

@pavel-izmailov sure thing! I had been procrastinating sending 😅 . I just tried to share it with you via google drive. Let me know if you've received it and if there are any issues!

izmailovpavel commented 11 months ago

Hey @PaulScemama, sorry for taking so long to respond, I received the file! Added it to the readme here. Thank you so much for digging into it, I really appreciate your help!

PaulScemama commented 11 months ago

@izmailovpavel no worries! Great, thanks so much -- I just took a look. Maybe it would be helpful to add what each version corresponds to?

v2: cifar10_corrupted/gaussian_noise_2 v3: cifar10_corrupted/brightness_3 v4: cifar10_corrupted/pixelate_4 v5: cifar10_corrupted/zoom_blur_5

So for example, instead of

'x_test_v2': 10k test images from a corrupted version of Cifar10

Have

'x_test_v2': 10k test images from cifar10_corrupted/gaussian_noise_2.

I saw you referenced this discussion so maybe it is not needed though. Thanks again!

izmailovpavel commented 11 months ago

Hey @PaulScemama, good point, thank you! I updated the Readme to mention the datasets used for each split!

PaulScemama commented 11 months ago

Sounds good @izmailovpavel ! Feel free to close the issue whenever. I appreciate all the help :)

izmailovpavel commented 11 months ago

@PaulScemama sounds good! I really appreciate you digging into it and putting the effort to make the data more accessible to people :) Thank you!

PaulScemama commented 11 months ago

@izmailovpavel of course! I've benefitted so much from your code repositories in my own research journey, so it's the least I could do. Last question: do you mind if I email you to ask some questions about your phd experience, preparing for it, etc? I think I'll seriously consider applying in the next cycle (next fall 2024). I'm quite interested in bayesian machine learning, bayesian experimental design (active learning, optimization), and approximate inference, and maybe you'll have some good recommendations for professors I should look into that are interested in the same things.

izmailovpavel commented 11 months ago

Hey @PaulScemama, definitely, feel free to email me (pi390@nyu.edu) and we can schedule a meeting to chat about it! Glad to hear you are interested in doing a PhD :)

PaulScemama commented 10 months ago

@izmailovpavel thank you so much! I will send you an email shortly

izmailovpavel / neurips_bdl_starter_kit

HMC predictions on CIFAR-C, MedMNIST, and Diabetic Retinopathy Dataset #4