attila-dusnoki-htec commented 9 months ago

To properly test model accuracy, it is not enough to use random data, since it might not cover the proper range of the possible data.

We should collect candidates for datasets, and assign models to them.

The idea is to use public datasets. HuggingFace provides datasets. It provies also helpers to load these in python. Downloading is not enough, since each model has a pre- and post-processing step. It can vary for each model.

The whole process should be automatic and deterministic.

attila-dusnoki-htec commented 9 months ago

We should make this work with test_runner.py

A good start would be to enable 2-3 datasets with 1-2 models.

attila-dusnoki-htec commented 9 months ago

The current work is here

Covering all 3 major data sources: ImageNet2012 (Images), SQuADv1.1 (Text), LibriSpeechASR (Sound) With (single-)models that use these with the proper preprocessing steps. The result is in the format that test_runner.py can run.

Next step: Enable multi-model scenarios e.g. Whisper (encoder-decoder), Stable Diffusion (text-encoder, unet, vae-decoder), etc.

attila-dusnoki-htec commented 7 months ago

Current state

Code link to reproduce

Note: The default atol/rtol values were used. For fp16, looking at the logs, the numbers look correct, but less precise then the tolerance. It might be better to report the difference as well to make te comparision easier.

We can download and generate actual test data for the following models with its dataset:

Imagenet dataset (image)

resnet50_v1.5_fp16.log

Input: pixel_values, shape: (1, 3, 224, 224)

Test "resnet50_v1.5" has 7 cases:
     Passed: 0
     Failed: 7

resnet50_v1.5_fp32.log

Input: pixel_values, shape: (1, 3, 224, 224)

Test "resnet50_v1.5" has 7 cases:
     Passed: 7
     Failed: 0

resnet50_v1_fp16.log

Input: input_tensor:0, shape: (1, 3, 224, 224)

Test "resnet50_v1" has 7 cases:
     Passed: 1
     Failed: 6

resnet50_v1_fp32.log

Input: input_tensor:0, shape: (1, 3, 224, 224)

Test "resnet50_v1" has 7 cases:
     Passed: 7
     Failed: 0

timm-mobilenetv3-large_fp16.log

Input: pixel_values, shape: (1, 3, 224, 224)

Test "timm-mobilenetv3-large" has 7 cases:
     Passed: 0
     Failed: 7

timm-mobilenetv3-large_fp32.log

Input: pixel_values, shape: (1, 3, 224, 224)

Test "timm-mobilenetv3-large" has 7 cases:
     Passed: 7
     Failed: 0

vit-base-patch16-224_fp16.log

Input: pixel_values, shape: (1, 3, 224, 224)

Test "vit-base-patch16-224" has 7 cases:
     Passed: 0
     Failed: 7

vit-base-patch16-224_fp32.log

Input: pixel_values, shape: (1, 3, 224, 224)

Test "vit-base-patch16-224" has 7 cases:
     Passed: 7
     Failed: 0

SQuAD dataset (text)

distilbert-base-cased-distilled-squad_fp16.log

Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)

Test "distilbert-base-cased-distilled-squad" has 7 cases:
     Passed: 0
     Failed: 7

distilbert-base-cased-distilled-squad_fp32.log

Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)

Test "distilbert-base-cased-distilled-squad" has 7 cases:
     Passed: 7
     Failed: 0

gpt-j_fp16.log

Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Input: position_ids, shape: (1, 384)

Test "gpt-j" has 42 cases:
     Passed: 0
     Failed: 42

gpt-j_fp32.log

Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Input: position_ids, shape: (1, 384)

Test "gpt-j" has 42 cases:
     Passed: 42
     Failed: 0

roberta-base-squad2_fp16.log

Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)

Test "roberta-base-squad2" has 7 cases:
     Passed: 0
     Failed: 7

roberta-base-squad2_fp32.log

Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)

Test "roberta-base-squad2" has 7 cases:
     Passed: 7
     Failed: 0

LibriSpeech dataset (audio)

wav2vec2-base-960h_fp16.log

Input: input_values, shape: (1, 105440)

Test "wav2vec2-base-960h" has 7 cases:
     Passed: 0
     Failed: 7

wav2vec2-base-960h_fp32.log

Input: input_values, shape: (1, 105440)

Test "wav2vec2-base-960h" has 7 cases:
     Passed: 1
     Failed: 6

whisper-small-en_fp16.log

Input: input_features, shape: (1, 80, 3000)
Input: decoder_input_ids, shape: (1, 448)

Test "whisper-small-en" has 21 cases:
     Passed: 0
     Failed: 21

whisper-small-en_fp32.log

Input: input_features, shape: (1, 80, 3000)
Input: decoder_input_ids, shape: (1, 448)

Test "whisper-small-en" has 21 cases:
     Passed: 21
     Failed: 0

attila-dusnoki-htec commented 7 months ago

Current State pt2

Imagenet dataset (image)

clip_vit_fp16.log

Input: input_ids, shape: (10, 77)
Input: attention_mask, shape: (10, 77)
Input: pixel_values, shape: (1, 3, 224, 224)

Test "clip-vit-large-patch14" has 7 cases:
     Passed: 0
     Failed: 7

clip_vit_fp32.log

Input: input_ids, shape: (10, 77)
Input: attention_mask, shape: (10, 77)
Input: pixel_values, shape: (1, 3, 224, 224)

Test "clip-vit-large-patch14" has 7 cases:
     Passed: 7
     Failed: 0

SQuAD dataset (text)

gemma_2b_it_fp16.log

Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Input: position_ids, shape: (1, 384)

Test "gemma-2b-it" has 30 cases:
     Passed: 0
     Failed: 30

gemma_2b_it_fp32.log

Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Input: position_ids, shape: (1, 384)

Test "gemma-2b-it" has 30 cases:
     Passed: 30
     Failed: 0

t5_base_fp16.log

Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Input: decoder_input_ids, shape: (1, 384)

Test "t5-base" has 30 cases:
     Passed: 0
     Failed: 30

Note: All Actual value is zero!

t5_base_fp32.log

Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Input: decoder_input_ids, shape: (1, 384)

Test "t5-base" has 30 cases:
     Passed: 30
     Failed: 0

attila-dusnoki-htec commented 7 months ago

Current State pt3

SQuAD dataset (text)

bert-large-uncased_fp16.log

Input: input_ids, shape: (1, 384)
Input: token_type_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)

Test "bert-large-uncased" has 5 cases:
     Passed: 0
     Failed: 5

bert-large-uncased_fp32.log

Input: input_ids, shape: (1, 384)
Input: token_type_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)

Test "bert-large-uncased" has 5 cases:
     Passed: 4
     Failed: 1

llama2-7b-chat-hf_fp16.log

Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Input: position_ids, shape: (1, 384)

Test "llama2-7b-chat-hf" has 17 cases:
     Passed: 0
     Failed: 17

Note: Major difference, not jut a simple tolarence issue.

llama2-7b-chat-hf_fp32.log

Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Input: position_ids, shape: (1, 384)

Test "llama2-7b-chat-hf" has 17 cases:
     Passed: 17
     Failed: 0

llama3-8b-instruct_fp16.log

Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Input: position_ids, shape: (1, 384)

Test "llama3-8b-instruct" has 25 cases:
     Passed: 0
     Failed: 25

Note: Major difference, not jut a simple tolarence issue.

llama3-8b-instruct_fp32.log

Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Input: position_ids, shape: (1, 384)

Test "llama3-8b-instruct" has 25 cases:
     Passed: 25
     Failed: 0

attila-dusnoki-htec commented 7 months ago

Add the following models:

DLRM-DCNv2
Wide and Deep
SD 1.5/XL
RNNT
BERT-L
Resnet50 1.5

attila-dusnoki-htec commented 7 months ago

DLRM-DCNv2

Can't be exported to onnx due to KeyedJaggedTensor

But it could be created with these: dlrm_model.py export_dlrm.py

To create the dataset: https://github.com/facebookresearch/dlrm/blob/main/torchrec_dlrm/scripts/process_Criteo_1TB_Click_Logs_dataset.sh These can be changed to only use the day_23 file.

attila-dusnoki-htec commented 6 months ago

Current State pt4

COCO dataset (image) + Style prompts (text)

stable-diffusion-2-1_text_encoder_fp16.log

Input: input_ids, shape: (2, 77)

Test "text_encoder" has 5 cases:
     Passed: 0
     Failed: 5

stable-diffusion-2-1_text_encoder_fp32.log

Input: input_ids, shape: (2, 77)

Test "text_encoder" has 5 cases:
     Passed: 5
     Failed: 0

stable-diffusion-2-1_unet_fp16.log

Input: sample, shape: (2, 4, 64, 64)
Input: encoder_hidden_states, shape: (2, 77, 1024)
Input: timestep, shape: (1,)

Test "unet" has 25 cases:
     Passed: 0
     Failed: 25

Note: Outputs are nans

stable-diffusion-2-1_unet_fp32.log

Input: sample, shape: (2, 4, 64, 64)
Input: encoder_hidden_states, shape: (2, 77, 1024)
Input: timestep, shape: (1,)

Test "unet" has 25 cases:
     Passed: 25
     Failed: 0

stable-diffusion-2-1_vae_decoder_fp16.log

Input: latent_sample, shape: (1, 4, 64, 64)

Test "vae_decoder" has 5 cases:
     Passed: 0
     Failed: 5

stable-diffusion-2-1_vae_decoder_fp32.log

Input: latent_sample, shape: (1, 4, 64, 64)

Test "vae_decoder" has 5 cases:
     Passed: 5
     Failed: 0

stable-diffusion-2-1_vae_encoder_fp16.log

Input: sample, shape: (1, 3, 512, 512)

Test "vae_encoder" has 5 cases:
     Passed: 0
     Failed: 5

Note: Some of the outputs are nans

stable-diffusion-2-1_vae_encoder_fp32.log

Input: sample, shape: (1, 3, 512, 512)

Test "vae_encoder" has 5 cases:
     Passed: 0
     Failed: 5

stable-diffusion-xl_text_encoder_2_fp16.log

Input: input_ids, shape: (2, 77)

Test "text_encoder_2" has 5 cases:
     Passed: 0
     Failed: 5

stable-diffusion-xl_text_encoder_2_fp32.log

Input: input_ids, shape: (2, 77)

Test "text_encoder_2" has 5 cases:
     Passed: 5
     Failed: 0

stable-diffusion-xl_text_encoder_fp16.log

Input: input_ids, shape: (2, 77)

Test "text_encoder" has 5 cases:
     Passed: 0
     Failed: 5

stable-diffusion-xl_text_encoder_fp32.log

Input: input_ids, shape: (2, 77)

Test "text_encoder" has 5 cases:
     Passed: 5
     Failed: 0

stable-diffusion-xl_unet_fp16.log

Input: sample, shape: (2, 4, 128, 128)
Input: encoder_hidden_states, shape: (2, 77, 2048)
Input: timestep, shape: (1,)
Input: text_embeds, shape: (2, 1280)
Input: time_ids, shape: (2, 6)

Test "unet" has 25 cases:
     Passed: 0
     Failed: 25

stable-diffusion-xl_unet_fp32.log

Input: sample, shape: (2, 4, 128, 128)
Input: encoder_hidden_states, shape: (2, 77, 2048)
Input: timestep, shape: (1,)
Input: text_embeds, shape: (2, 1280)
Input: time_ids, shape: (2, 6)

Test "unet" has 25 cases:
     Passed: 25
     Failed: 0

stable-diffusion-xl_vae_decoder_fp16.log

Input: latent_sample, shape: (1, 4, 128, 128)

Test "vae_decoder" has 5 cases:
     Passed: 0
     Failed: 5

Note: Outputs are nans

stable-diffusion-xl_vae_decoder_fp32.log

Input: latent_sample, shape: (1, 4, 128, 128)

Test "vae_decoder" has 5 cases:
     Passed: 5
     Failed: 0

stable-diffusion-xl_vae_encoder_fp16.log

Input: sample, shape: (1, 3, 1024, 1024)

Test "vae_encoder" has 5 cases:
     Passed: 0
     Failed: 5

Note: Outputs are nans

stable-diffusion-xl_vae_encoder_fp32.log

Input: sample, shape: (1, 3, 1024, 1024)

Test "vae_encoder" has 5 cases:
     Passed: 0
     Failed: 5

migraphx-benchmark / AMDMIGraphX

Model testing with Datasets #168

Current state

Imagenet dataset (image)

SQuAD dataset (text)

LibriSpeech dataset (audio)

Current State pt2

Imagenet dataset (image)

SQuAD dataset (text)

Current State pt3

SQuAD dataset (text)

DLRM-DCNv2

Current State pt4

COCO dataset (image) + Style prompts (text)