ImageNet ResNet50 training pipeline

intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.

Apache License 2.0

6.52k stars 1.24k forks source link

ImageNet ResNet50 training pipeline #2798

Open jason-dai opened 5 years ago

jason-dai commented 5 years ago

Need end-to-end pipeline for large-scale ImageNet ResNet50 training (16 or 32 nodes) using MKL-DNN backend

wzhongyuan commented 5 years ago

This is the Resnet-50 training on ImageNet, not sure if it's sufficient or what additional functions are required if not ?

jason-dai commented 5 years ago

Do you have the details (node#, hyper-parameters, etc.)?

wzhongyuan commented 5 years ago

Yes, below link contains the parameters that can reproduce the result.

https://github.com/intel-analytics/BigDL/tree/master/spark/dl/src/main/scala/com/intel/analytics/bigdl/models/resnet

where you can find

spark-submit \
--verbose \
--master spark://xxx.xxx.xxx.xxx:xxxx \
--driver-memory 200g \
--conf "spark.serializer=org.apache.spark.serializer.JavaSerializer" \
--conf "spark.network.timeout=1000000" \
--executor-memory 200g \
--executor-cores 32 \
--total-executor-cores 2048 \
--class com.intel.analytics.bigdl.models.resnet.TrainImageNet \
dist/lib/bigdl-VERSION-jar-with-dependencies.jar \
-f hdfs://xxx.xxx.xxx.xxx:xxxx/imagenet \
--batchSize 8192 --nEpochs 90 --learningRate 0.1 --warmupEpoch 5 \
 --maxLr 3.2 --cache /cache  --depth 50 --classes 1000

wzhongyuan commented 5 years ago

The pictures are raw images (not resized), we trained on 64 nodes with above Hyper Parameters and got 76.12% Top1 accuracy

jason-dai commented 5 years ago

Is it based on MKL-DNN backend?

wzhongyuan commented 5 years ago

No, it's based on MKL-BLAS, but the time to train has been reduced to ~40 hours

jason-dai commented 5 years ago

OK - then we need an end-to-end pipeline for large-scale ImageNet ResNet50 training (16 or 32 nodes) using MKL-DNN backend :smiley: