smpanaro / more-ane-transformers

Run transformers (incl. LLMs) on the Apple Neural Engine.
51 stars 1 forks source link

Pythia 2.8b converted by convert.py script will not call ANE #1

Open adonishong opened 1 year ago

adonishong commented 1 year ago

Appreciate for your guys work.

My testing machine is M2 Max 64GB memory.

With generate.py script, the Pythia 2.8b mlpackage from GitHub release will call ANE no matter with --compute_unit="All" or --compute_unit="CPUAndANE". However, if I try to convert Pythia 2.8b from convert.py, the mlpackage will not call ANE, with --compute_unit="All", CPU and GPU will be used; with --compute_unit="CPUAndANE", only CPU will be called. Pythia-410m shows different case, both mlpackage download from GitHub release and converted from convert.py script could call ANE.

BTW, Pythia-6.9b could be converted from convert.py script, and with generate.py and --compute_unit="CPUAndGPU", it works well, but it will not call ANE also.

smpanaro commented 1 year ago

👋 Hey!

The 2.8b in the GitHub release is not directly the result you get by running convert.py. For larger models you need to do two more steps:

  1. First split the model into chunks. You will probably need to edit this line -- it looks like I used 670 for the 2.8b model.
    python -m src.experiments.chunk_model --mlpackage-path pythia-1.4b_2023_04_11-20_54_12.mlpackage -o .
  2. This should give you multiple files that end in _chunk{1,2,3,...}.mlpackage.
  3. Next you need to join the junks into a single pipeline model. The argument can be any of the _chunk{1,2,3...}.mlpackage files.
    python -m src.experiments.make_pipeline pythia-1.4b_2023_04_11-20_54_12_chunk1.mlpackage
  4. You can tell that it worked by doing Show Package Contents and seeing that the Data > com.apple.CoreML > weights folder has many files (one per chunk). image image

That should allow you to recreate the 2.8b model that runs on ANE. Two more things that might be helpful:

Measuring ANE

I'm not sure how you are checking to see if the model runs on the ANE, but I would recommend using the --wait flag and attaching the CoreML tool from Instruments. Xcode really struggles with these larger models.

python generate.py --model_path gpt2-medium.mlmodelc --compute_unit CPUAndANE --wait
Screenshot 2023-08-22 at 10 14 25 AM Screenshot 2023-08-22 at 10 15 20 AM

For the chunked models you should see one "Neural Engine Prediction" block for each chunk of the model -- it will be obvious if some chunks run on ANE and some do not. (This screenshot is not a chunked model.) There will be a tiny gap between each block that runs on CPU, but it should be very small.

Screenshot 2023-08-22 at 10 18 34 AM

6.9b Model

I only have an M1, but I think there is a chance you can get the 6.9b running on the M2's ANE. You will definitely need to use the chunk_model and make_pipeline tools. I would start with 670 for the chunk size (like 2.8b) and try smaller if that doesn't work. Let me know if you try, I'd be happy to help try and figure out how to get it working!

Sorry for the slow response and also that all of this is missing from the documentation.