Discrepencies for a fine-tuned DistilBERT model

I am trying to develop materials for mobile developers to provide a way for them to compare BERT-based models that are specifically developed for mobile deployments. Currently, I have chosen MobileBERT and DistilBERT for that (repository).

Here's what I have done so far -

Fine-tune DistilBERT on the SST-2 dataset (text classification) (Kaggle Kernel).
Generate dynamic-range and float16 quantized TensorFlow Lite models (Kaggle Kernel).
Evaluate the TensorFlow Lite models on the SST-2 development set (Notebook).

Surprisingly, the TensorFlow Lite models achieve a random performance (~50% accuracy) on the development set. This is sharply in contrast with the original fine-tuned model performance (accuracy) which is about ~90%.

I am wondering if I am missing out on something. Any pointers would be really helpful.

Cc: @khanhlvg

huggingface / tflite-android-transformers

Discrepencies for a fine-tuned DistilBERT model #10