Gpt2 large for onnx exportation and int8 quantization

lucasjinreal commented 2 years ago

Hi, for model big as 7GB, does transformers support export to onnx?? Any tutorial about big model?

lewtun commented 2 years ago

Hi @jinfagang as far as I know, the transformers.onnx package should work for exporting large models. The only difference is that you'll typically see a number of additional files are created because ONNX uses Protobuf and this can only serialise files in 2GB chunks.

It would be helpful to have a reproducible code example if you are having trouble exporting a particular checkpoint

lucasjinreal commented 2 years ago

@lewtun Hi, actually I am just trying using the same technique in Bert quantization inside onnxruntime example. Bert can be optimized, and I can shrink the model size from 1.6G to 400M in int8. But when I trying on 7G model, it fails. So I don't know is the optimization doesn't supported such big model, or does the huge model can not be properly loaded by ort. It just returns error like read onnx model failed.

lewtun commented 2 years ago

Hey @jinfagang can you please share a code snippet that shows which checkpoint you're using and an example of how you're loading the exported model?

lucasjinreal commented 2 years ago

@lewtun Hi, I suppose you familiar with GPT2 and magnetron. Let me explain how I do it.

firstly, I download a pretrained giant Chinese GPT2-large model, which might trained with Megatron-LM, it was so huge, that weights divided into 2 parts and should loaded by 2 cards. You can download them from: https://huggingface.co/TsinghuaAI/CPM-Generate , Oh, I just recognized it was already on hugging spaces;
then, I using tools you must familiar via python -m transformers.onnx ./models convert the model to onnx using hugging faces tools;
then you got this huge, onnx, which separately into several tiny weights (this is the way onnx stores huge model);
then I using same Optimization on onnxruntime to optimized, it fails to produces int8 model.

the quantization like this:

optimization_options = FusionOptions("gpt2")
        optimization_options.enable_gelu = True
        optimization_options.enable_layer_norm = True
        optimization_options.enable_attention = True
        optimization_options.enable_skip_layer_norm = True
        optimization_options.enable_embed_layer_norm = True
        optimization_options.enable_bias_skip_layer_norm = True
        optimization_options.enable_bias_gelu = True
        optimization_options.enable_gelu_approximation = False

        logger.warning(f">>>>>>> Start optimizing ONNX graph on: {model_type}")
        # for magatron GPT2
        optimizer = optimize_model(
            onnx_model_f,
            model_type=model_type,
            # num_heads=16,
            num_heads=32,
            # hidden_size=1024,
            hidden_size=2560,
            optimization_options=optimization_options,
            opt_level=0,
            use_gpu=False,
            only_onnxruntime=False,
        )

As you can see my commented part is the smaller GPT model, which can be successfully quantized. But huge model fails. If you interested try same 7.0GB model quantization that would be very great and I really would like to see if you can managed to convert it.

huggingface / transformers

Gpt2 large for onnx exportation and int8 quantization #16195