Add benchmark models that are not easily accessible

masahi commented 2 years ago

The motivation is that, there has been growing interest in testing and benchmarking int8 BERT. But BERT and other transformer models are hard to quantize or import to TVM properly, so until recently they were not available to us for benchmarking.

Recently I found that NVIDIA FasterTransformer repo has an example of quantizing BERT by PTQ or QAT using TensorRT's pytorch_quantization tool. And thanks to the recent work on https://github.com/apache/tvm/pull/10239, we can now import those "fake-quantized" QAT BERT models into Relay and convert them into a fully-integer model.

Example usages will be provided soon in the tvm repo under python/tvm/meta_schedule/testing/XXX.py. Also see https://github.com/tlc-pack/TLCBench/pull/5#issuecomment-1067431710

I wonder if we need to worry about license issues?

cc @junrushao1994 @areusch @tqchen @comaniac

areusch commented 2 years ago

cc @driazati as there's some overlap with tlc-pack/ci-data

comaniac commented 2 years ago

IIUC, you want to add the serialized binary files of these models to this repo. In terms of the license, I'd like to confirm that did you generate these binary files by yourself based on the FasterTransformer, or are these directly cloned from somewhere in that repo?

I checked the FastTransformer repo and it is Apache-2.0 license, so it is fine for us to use any code from that repo. We only need to add one line saying this is modified from FastTransformer repo. In the case of binary files, I think a separate README.md under the same directory also works.

masahi commented 2 years ago

Yes I generated both of them myself. I added ugly hack in one of scripts in FasterTransformer to manually export the model to ONNX. A new README is there already, I can add more details on the export process if desired.

I was not aware of ci-data, but these models are not for CI. So I think this repo is better.

comaniac commented 2 years ago

Got it. Then IMHO we just need to explicitly saying like generated from XXX under Apache-2.0 license in README.

masahi commented 2 years ago

Ok, added more details on the export process and licensed under Apache-2.0 blurb. README is here https://github.com/masahi/TLCBench/blob/bench-models/models/README.md

I think it is good to go.

masahi commented 2 years ago

An example on how to use quantized BERT (Running requires https://github.com/apache/tvm/pull/10596):

Import https://gist.github.com/masahi/fc5640cf695a21e765d030e1b9f3fec9. We can tweak the batch size / seq len to make the workload heavier / lighter https://gist.github.com/masahi/fc5640cf695a21e765d030e1b9f3fec9#file-qat_bert_import-py-L10-L11
Compile and run on autotvm or cublas tensorcore: https://gist.github.com/masahi/136e86bc813754de67f35ffb86c1fedd

Example output:

One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details.
Evaluate inference time cost with target cuda ... 
Execution time summary:                     
 mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
  18.4024      18.2218      19.6602      18.1031       0.4598   

One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details.
Evaluate inference time cost with target cuda -libs=cublas ...
Execution time summary:
 mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
   9.2887       9.2200       9.7776       9.1559       0.2160

tlc-pack / TLCBench

Add benchmark models that are not easily accessible #5