Open attila-dusnoki-htec opened 9 months ago
We should make this work with test_runner.py
A good start would be to enable 2-3 datasets with 1-2 models.
The current work is here
Covering all 3 major data sources: ImageNet2012 (Images), SQuADv1.1 (Text), LibriSpeechASR (Sound) With (single-)models that use these with the proper preprocessing steps. The result is in the format that test_runner.py can run.
Next step: Enable multi-model scenarios e.g. Whisper (encoder-decoder), Stable Diffusion (text-encoder, unet, vae-decoder), etc.
Note: The default atol/rtol values were used. For fp16, looking at the logs, the numbers look correct, but less precise then the tolerance. It might be better to report the difference as well to make te comparision easier.
We can download and generate actual test data for the following models with its dataset:
Input: pixel_values, shape: (1, 3, 224, 224)
Test "resnet50_v1.5" has 7 cases:
Passed: 0
Failed: 7
Input: pixel_values, shape: (1, 3, 224, 224)
Test "resnet50_v1.5" has 7 cases:
Passed: 7
Failed: 0
Input: input_tensor:0, shape: (1, 3, 224, 224)
Test "resnet50_v1" has 7 cases:
Passed: 1
Failed: 6
Input: input_tensor:0, shape: (1, 3, 224, 224)
Test "resnet50_v1" has 7 cases:
Passed: 7
Failed: 0
timm-mobilenetv3-large_fp16.log
Input: pixel_values, shape: (1, 3, 224, 224)
Test "timm-mobilenetv3-large" has 7 cases:
Passed: 0
Failed: 7
timm-mobilenetv3-large_fp32.log
Input: pixel_values, shape: (1, 3, 224, 224)
Test "timm-mobilenetv3-large" has 7 cases:
Passed: 7
Failed: 0
Input: pixel_values, shape: (1, 3, 224, 224)
Test "vit-base-patch16-224" has 7 cases:
Passed: 0
Failed: 7
Input: pixel_values, shape: (1, 3, 224, 224)
Test "vit-base-patch16-224" has 7 cases:
Passed: 7
Failed: 0
distilbert-base-cased-distilled-squad_fp16.log
Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Test "distilbert-base-cased-distilled-squad" has 7 cases:
Passed: 0
Failed: 7
distilbert-base-cased-distilled-squad_fp32.log
Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Test "distilbert-base-cased-distilled-squad" has 7 cases:
Passed: 7
Failed: 0
Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Input: position_ids, shape: (1, 384)
Test "gpt-j" has 42 cases:
Passed: 0
Failed: 42
Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Input: position_ids, shape: (1, 384)
Test "gpt-j" has 42 cases:
Passed: 42
Failed: 0
Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Test "roberta-base-squad2" has 7 cases:
Passed: 0
Failed: 7
Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Test "roberta-base-squad2" has 7 cases:
Passed: 7
Failed: 0
Input: input_values, shape: (1, 105440)
Test "wav2vec2-base-960h" has 7 cases:
Passed: 0
Failed: 7
Input: input_values, shape: (1, 105440)
Test "wav2vec2-base-960h" has 7 cases:
Passed: 1
Failed: 6
Input: input_features, shape: (1, 80, 3000)
Input: decoder_input_ids, shape: (1, 448)
Test "whisper-small-en" has 21 cases:
Passed: 0
Failed: 21
Input: input_features, shape: (1, 80, 3000)
Input: decoder_input_ids, shape: (1, 448)
Test "whisper-small-en" has 21 cases:
Passed: 21
Failed: 0
Input: input_ids, shape: (10, 77)
Input: attention_mask, shape: (10, 77)
Input: pixel_values, shape: (1, 3, 224, 224)
Test "clip-vit-large-patch14" has 7 cases:
Passed: 0
Failed: 7
Input: input_ids, shape: (10, 77)
Input: attention_mask, shape: (10, 77)
Input: pixel_values, shape: (1, 3, 224, 224)
Test "clip-vit-large-patch14" has 7 cases:
Passed: 7
Failed: 0
Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Input: position_ids, shape: (1, 384)
Test "gemma-2b-it" has 30 cases:
Passed: 0
Failed: 30
Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Input: position_ids, shape: (1, 384)
Test "gemma-2b-it" has 30 cases:
Passed: 30
Failed: 0
Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Input: decoder_input_ids, shape: (1, 384)
Test "t5-base" has 30 cases:
Passed: 0
Failed: 30
Note: All Actual value
is zero
!
Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Input: decoder_input_ids, shape: (1, 384)
Test "t5-base" has 30 cases:
Passed: 30
Failed: 0
Input: input_ids, shape: (1, 384)
Input: token_type_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Test "bert-large-uncased" has 5 cases:
Passed: 0
Failed: 5
Input: input_ids, shape: (1, 384)
Input: token_type_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Test "bert-large-uncased" has 5 cases:
Passed: 4
Failed: 1
Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Input: position_ids, shape: (1, 384)
Test "llama2-7b-chat-hf" has 17 cases:
Passed: 0
Failed: 17
Note: Major difference
, not jut a simple tolarence issue.
Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Input: position_ids, shape: (1, 384)
Test "llama2-7b-chat-hf" has 17 cases:
Passed: 17
Failed: 0
Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Input: position_ids, shape: (1, 384)
Test "llama3-8b-instruct" has 25 cases:
Passed: 0
Failed: 25
Note: Major difference
, not jut a simple tolarence issue.
Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Input: position_ids, shape: (1, 384)
Test "llama3-8b-instruct" has 25 cases:
Passed: 25
Failed: 0
Add the following models:
Can't be exported to onnx due to KeyedJaggedTensor
But it could be created with these: dlrm_model.py export_dlrm.py
To create the dataset: https://github.com/facebookresearch/dlrm/blob/main/torchrec_dlrm/scripts/process_Criteo_1TB_Click_Logs_dataset.sh These can be changed to only use the day_23 file.
stable-diffusion-2-1_text_encoder_fp16.log
Input: input_ids, shape: (2, 77)
Test "text_encoder" has 5 cases:
Passed: 0
Failed: 5
stable-diffusion-2-1_text_encoder_fp32.log
Input: input_ids, shape: (2, 77)
Test "text_encoder" has 5 cases:
Passed: 5
Failed: 0
stable-diffusion-2-1_unet_fp16.log
Input: sample, shape: (2, 4, 64, 64)
Input: encoder_hidden_states, shape: (2, 77, 1024)
Input: timestep, shape: (1,)
Test "unet" has 25 cases:
Passed: 0
Failed: 25
Note: Outputs are nan
s
stable-diffusion-2-1_unet_fp32.log
Input: sample, shape: (2, 4, 64, 64)
Input: encoder_hidden_states, shape: (2, 77, 1024)
Input: timestep, shape: (1,)
Test "unet" has 25 cases:
Passed: 25
Failed: 0
stable-diffusion-2-1_vae_decoder_fp16.log
Input: latent_sample, shape: (1, 4, 64, 64)
Test "vae_decoder" has 5 cases:
Passed: 0
Failed: 5
stable-diffusion-2-1_vae_decoder_fp32.log
Input: latent_sample, shape: (1, 4, 64, 64)
Test "vae_decoder" has 5 cases:
Passed: 5
Failed: 0
stable-diffusion-2-1_vae_encoder_fp16.log
Input: sample, shape: (1, 3, 512, 512)
Test "vae_encoder" has 5 cases:
Passed: 0
Failed: 5
Note: Some of the outputs are nan
s
stable-diffusion-2-1_vae_encoder_fp32.log
Input: sample, shape: (1, 3, 512, 512)
Test "vae_encoder" has 5 cases:
Passed: 0
Failed: 5
stable-diffusion-xl_text_encoder_2_fp16.log
Input: input_ids, shape: (2, 77)
Test "text_encoder_2" has 5 cases:
Passed: 0
Failed: 5
stable-diffusion-xl_text_encoder_2_fp32.log
Input: input_ids, shape: (2, 77)
Test "text_encoder_2" has 5 cases:
Passed: 5
Failed: 0
stable-diffusion-xl_text_encoder_fp16.log
Input: input_ids, shape: (2, 77)
Test "text_encoder" has 5 cases:
Passed: 0
Failed: 5
stable-diffusion-xl_text_encoder_fp32.log
Input: input_ids, shape: (2, 77)
Test "text_encoder" has 5 cases:
Passed: 5
Failed: 0
stable-diffusion-xl_unet_fp16.log
Input: sample, shape: (2, 4, 128, 128)
Input: encoder_hidden_states, shape: (2, 77, 2048)
Input: timestep, shape: (1,)
Input: text_embeds, shape: (2, 1280)
Input: time_ids, shape: (2, 6)
Test "unet" has 25 cases:
Passed: 0
Failed: 25
stable-diffusion-xl_unet_fp32.log
Input: sample, shape: (2, 4, 128, 128)
Input: encoder_hidden_states, shape: (2, 77, 2048)
Input: timestep, shape: (1,)
Input: text_embeds, shape: (2, 1280)
Input: time_ids, shape: (2, 6)
Test "unet" has 25 cases:
Passed: 25
Failed: 0
stable-diffusion-xl_vae_decoder_fp16.log
Input: latent_sample, shape: (1, 4, 128, 128)
Test "vae_decoder" has 5 cases:
Passed: 0
Failed: 5
Note: Outputs are nan
s
stable-diffusion-xl_vae_decoder_fp32.log
Input: latent_sample, shape: (1, 4, 128, 128)
Test "vae_decoder" has 5 cases:
Passed: 5
Failed: 0
stable-diffusion-xl_vae_encoder_fp16.log
Input: sample, shape: (1, 3, 1024, 1024)
Test "vae_encoder" has 5 cases:
Passed: 0
Failed: 5
Note: Outputs are nan
s
stable-diffusion-xl_vae_encoder_fp32.log
Input: sample, shape: (1, 3, 1024, 1024)
Test "vae_encoder" has 5 cases:
Passed: 0
Failed: 5
To properly test model accuracy, it is not enough to use random data, since it might not cover the proper range of the possible data.
We should collect candidates for datasets, and assign models to them.
The idea is to use public datasets. HuggingFace provides datasets. It provies also helpers to load these in python. Downloading is not enough, since each model has a pre- and post-processing step. It can vary for each model.
The whole process should be automatic and deterministic.