[bounty] CPU inference support, Mac M1/M2 inference support

olegklimov commented 1 year ago

There are several projects aiming to make inference on CPU efficient.

The first part is research:

Which project works better,
And compatible with Refact license,
And doesn't bloat the docker too much,
And allows to use scratchpads similar to how inference_hf.py does it (needs a callback that streams output and allows to stop),
Does it include Mac M1/M2 support, or does it make sense to address Mac separately.

Please finish the first part, get a "go-ahead" for the second part.

The second part is implementation:

Script similar to inference_hf.py,
Little code,
Not much dependencies,
Demonstrate that it works with Refact-1.6b model, as well as StarCoder (at least the smaller sizes),
Integration with UI and watchdog is a plus, but efficient inference is obviously the priority.

olegklimov commented 1 year ago

/bounty $2000

algora-pbc[bot] commented 1 year ago

💎 $2,000 bounty created by olegklimov 🙋 If you start working on this, comment /attempt #77 to notify everyone 👉 To claim this bounty, submit a pull request that includes the text /claim #77 somewhere in its body 📝 Before proceeding, please make sure you can receive payouts in your country 💵 Payment arrives in your account 2-5 days after the bounty is rewarded 💯 You keep 100% of the bounty award 🙏 Thank you for contributing to smallcloudai/refact!

Attempt	Started (GMT+0)	Solution
🔴 @Akshay-Patel-dev	Aug 25, 2023, 11:44:51 PM	WIP
🟢 @shobhit9957	Aug 26, 2023, 10:38:57 AM	WIP
🟢 @benxh1995	Sep 4, 2023, 11:51:23 PM	WIP
🟢 @ds5t5	Sep 25, 2023, 1:52:54 AM	#122

Akshay-Patel-dev commented 1 year ago

/attempt #77

Options

Cancel my attempt

shobhit9957 commented 1 year ago

/attempt #77 hey @olegklimov I would like to contribute.. can you please provide some more description about this project. I'm a beginner here...

Options

Cancel my attempt

algora-pbc[bot] commented 1 year ago

Note: The user @Akshay-Patel-dev is already attempting to complete issue #77 and claim the bounty. If you attempt to complete the same issue, there is a chance that @Akshay-Patel-dev will complete the issue first, and be awarded the bounty. We recommend discussing with @Akshay-Patel-dev and potentially collaborating on the same solution versus creating an alternate solution.

olegklimov commented 1 year ago

I'm a beginner here...

You can start with installing it and trying out.

But unless you already familiar with CPU inference libraries and LLMs in general, it might take you quite a long time to research.

shobhit9957 commented 1 year ago

I forked the project. And performed steps in the contributing.md file, but getting errors and unable to run it locally.

shobhit9957 commented 1 year ago

I added this , because in the error I encountered, this has to be added. install_requires=[ "triton>=12 0.0.3", ] in setup.py file, do you think adding this would be in the main branch is necessary?

olegklimov commented 1 year ago

CPU project names: ggml, ctransformers

benxh1995 commented 1 year ago

/attempt #77

I've got a preliminary version working with ctransformers. Inference on my M1 Mac for Starcoder is almost impossibly slow. The Refact-1.6b model still doesn't have GGUF or GGML versions available. Any attempts to make my own quants have failed using the official quantization scripts.

I can have a codellama FIM 7B demo up and running soon.

Options

Cancel my attempt

algora-pbc[bot] commented 1 year ago

Note: The user @shobhit9957 is already attempting to complete issue #77 and claim the bounty. If you attempt to complete the same issue, there is a chance that @shobhit9957 will complete the issue first, and be awarded the bounty. We recommend discussing with @shobhit9957 and potentially collaborating on the same solution versus creating an alternate solution.

olegklimov commented 1 year ago

An interesting link: https://github.com/ggerganov/llama.cpp/discussions/2948 -- how to convert HuggingFace model to GGUF format

Example of GGUFs of all sizes: https://huggingface.co/TheBloke/Llama-2-7B-GGUF

teleprint-me commented 1 year ago

@olegklimov

If this is still open, I might try it out.

Would the bounty claim still count for model conversion to GGUF format?

I understand it's first come, first serve. I'm just wondering if you're looking for a conversion script or if you just want general CPU support?

Quantization is a bit different from CPU inferencing and I'm just looking for clarity on the scope.

If you just want quantization, then I can look into creating a conversion script and I'll submit an attempt if I get it working and this is still open.

olegklimov commented 1 year ago

Hi @teleprint-me

Someone is trying the heavy lifting here: https://github.com/ggerganov/llama.cpp/issues/3061

teleprint-me commented 1 year ago

@olegklimov

Yes, I saw that. That's why I'm asking.

I know that in order to do it, one would need to use the GGUF library to convert the tensors.

It would require a custom script, like the others that already exist in the llama.cpp repository.

Your original request was in reference to the inference_hf.py script which is why I was asking for clarification.

olegklimov commented 1 year ago

@teleprint-me We are moving away from server-side scratchpads, in favor of client-side scratchpads. The plugins that can do it should land next week or a week after. There still has to be a script that takes the tasks to do, using completions_wait_batch() (in inference_worker.py) and streams the results, but only a simple left-to-right completion will be required soon.

In short, the requirement "Script similar to inference_hf.py" can now read "Script similar to inference_hf.py, but only /v1/completions needs to work".

Script to test:

curl http://127.0.0.1:8008/v1/completions -k \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "smallcloudai/Refact-1_6b-fim",
  "prompt": "def hello_world():\n    \"\"\"\n    This function prints \"Hello World!!!\" and brews coffee.\n    \"\"\"",
  "stream": true,
  "echo": false,
  "stop": ["\n\n"],
  "temperature": 0.8,
  "max_tokens": 50
}'

Stream and not stream should work, CPU output should be the same as current GPU output -- sounds like a well defined criterion.

teleprint-me commented 1 year ago

@olegklimov

That's exactly what I was looking for, thank you for the update.

I'll be reviewing the other open bounties in the coming days as well.

Currently, I'm setting up a custom OS for my new workstation and finalizing the prototype interface for my personal assistant.

If I make significant progress that aligns with the criteria for any of the outstanding bounties, I'll submit an attempt and, if appropriate, a subsequent PR.

Given that I'm working against a deadline, I'm highly motivated to contribute efficiently and effectively.

ds5t5 commented 1 year ago

/attempt https://github.com/smallcloudai/refact/issues/77

Options

Cancel my attempt

algora-pbc[bot] commented 1 year ago

💡 @ds5t5 submitted a pull request that claims the bounty. You can visit your org dashboard to reward. 👉 @ds5t5: To receive payouts, sign up on Algora, link your Github account and connect with Stripe on your dashboard.

olegklimov commented 1 year ago

Testing this:

./main -m ./Refact-1_6B-fim/ggml-model-f16.gguf -n 300 -p "write a function to multiple two integers in python"  --temp 1.0 --top-p 1.0 --top-k 1 --repeat_penalty 1.0

I see speed:

17 tokens/s on my MacBook Air M1,
- 4 tokens/s on Intel Xeon Gold 5315Y @ 3.20GHz

olegklimov commented 1 year ago

Xeon 5315Y	Threads -t N	speed tokens/s
-t 2	6
-t 4	11
-t 8	11
-t 16	4

M1 doesn't depend on threads.

olegklimov commented 1 year ago

First token, 551 prompt:

1172ms on M1
25404ms on Xeon 5315Y

I'd say that's the main problem for adoption of this. 551-token prompt isn't even that big, normally we have about 1950 tokens.

olegklimov commented 1 year ago

I tried Starcoder 1b, converted by TabbyML:

https://huggingface.co/TabbyML/StarCoder-1B/tree/main/ggml

"-m", "starcoder-1b-q8_0.gguf",
  897.71 ms /   557 tokens (    1.61 ms per token,   620.47 tokens per second)
 1334.68 ms /    49 runs   (   27.24 ms per token,    36.71 tokens per second)

"-m", "./starcoder-1b-f16.gguf",
  841.99 ms /   557 tokens (    1.51 ms per token,   661.53 tokens per second)
  243.18 ms /    49 runs   (   45.78 ms per token,    21.84 tokens per second)

"-m", "./Refact-1_6B-fim/ggml-model-f16.gguf",
  175.27 ms /   557 tokens (    2.11 ms per token,   473.93 tokens per second)
  962.51 ms /    49 runs   (   60.46 ms per token,    16.54 tokens per second)

teleprint-me commented 1 year ago

@olegklimov I think it has to do with the conversion process. They're looking into it. Typically the smaller models are much faster in llama.cpp.

teleprint-me commented 1 year ago

@olegklimov

MacBook Air M1

Try the 4-bit model, you should see a performance boost compared to the 16-bit model.

4-bit

llama_print_timings:        load time =    45.88 ms
llama_print_timings:      sample time =     3.91 ms /   300 runs   (    0.01 ms per token, 76706.72 tokens per second)
llama_print_timings: prompt eval time =    56.82 ms /     9 tokens (    6.31 ms per token,   158.38 tokens per second)
llama_print_timings:        eval time =  6762.85 ms /   299 runs   (   22.62 ms per token,    44.21 tokens per second)
llama_print_timings:       total time =  6933.22 ms

8-bit

llama_print_timings:        load time =    71.79 ms
llama_print_timings:      sample time =     3.72 ms /   300 runs   (    0.01 ms per token, 80623.49 tokens per second)
llama_print_timings: prompt eval time =    54.23 ms /     9 tokens (    6.03 ms per token,   165.94 tokens per second)
llama_print_timings:        eval time = 11387.12 ms /   299 runs   (   38.08 ms per token,    26.26 tokens per second)
llama_print_timings:       total time = 11553.91 ms

16-bit

llama_print_timings:        load time =  5828.46 ms
llama_print_timings:      sample time =     4.17 ms /   300 runs   (    0.01 ms per token, 71856.29 tokens per second)
llama_print_timings: prompt eval time =    72.36 ms /     9 tokens (    8.04 ms per token,   124.38 tokens per second)
llama_print_timings:        eval time = 20573.06 ms /   299 runs   (   68.81 ms per token,    14.53 tokens per second)
llama_print_timings:       total time = 20760.76 ms

Performance between the 16-bit and 32-bit converted tensor formats will perform the about the same on lower-end hardware.

Also, llama.cpp is still working on FIM implementation.

Quants are between 2-bit and 16-bit and support k-bit implementations if you aren't too familiar with the library or quant types.

olegklimov commented 1 year ago

OK it works nicely! So all the credit goes to @ds5t5, right?

olegklimov commented 1 year ago

@teleprint-me oh I see you've converted the 1.6b model in several quantizations, thank you for that! (I thought your tests were for llama, the name is confusing)

JegernOUTT commented 1 year ago

@ds5t5 Hi there!

We are going to slightly change modelling and weights respectively at the HF. The changes will include:

combining attn.k and attn.v into attn.kv
combining mlp.linear_1 and mlp.linear_3 into mlp.gate_up_proj

Guess we need to update https://github.com/ggerganov/llama.cpp/pull/3329 as well

ds5t5 commented 1 year ago

thanks. let me know when it is ready for model weight. i will rebase my llama.cpp PR to the latest branch of llama.cpp.

ds5t5 commented 1 year ago

@JegernOUTT can i ask why we decided to make the weight change? it seems not quite aligned with other popular models. they (falcon, llama) usually keep mlp.linear_1 and mlp.linear_3 separately. while for attention, it is usually qkv or q/k/v. only the original gpt2 model uses kv as one.

JegernOUTT commented 1 year ago

@ds5t5 We've updated the weights

We are using different inference backends in refact and when we train LORA models, we are struggling with modelling differences. So, we've decided to make these changes to the model and synchronize the implementation everywhere rather than keep some "hacks"

ds5t5 commented 1 year ago

@JegernOUTT it seems like the latest push breaks the

tokenizer = AutoTokenizer.from_pretrained("smallcloudai/Refact-1_6B-fim")

JegernOUTT commented 1 year ago

@ds5t5 what problem do you have? I've just checked it and found no issues

ds5t5 commented 1 year ago

nvm. i removed my cache and it works

teleprint-me commented 1 year ago

I'm working on a mod to get HF refact model to run on CPU since I don't have a working GPU backend at the moment. Not too many changes either and I just need to get the server running.

Also working on a refact template for llama-cpp-python for inference in refact, so it would just be plug in and play. This wouldn't work until @ds5t5's downstream changes make it into llama-cpp-python though.

Hopefully I'll have it done by the end of this weekend.

olegklimov commented 1 year ago

@teleprint-me We were thinking more along the lines of bundling llama.cpp with our rust binary, linked together. The rust binary is shipped with our next get plugins, such as VS Code. This might allow for a much lower cost of installation for the end user: no docker, nothing to install, no strange packages in local python, nothing to run separately or care about.

The largest problem is prompt prefill, about 4 seconds for 2048 tokens, on Apple M1. That's a bit too long for interactive use.

So I asked in llama.cpp what people think about architecture more suitable for CPU or M1, here https://github.com/ggerganov/llama.cpp/discussions/3395 . We can train a new model so it prefills prompt faster, we have the data and the GPUs!

Or maybe M2 will fix the speed :joy: (I didn't try yet).

teleprint-me commented 1 year ago

@olegklimov

Alright ☺️ No worries! After reviewing the code and attempting to come up with a minimalistic solution, this sounds like a better path forward if I'm being honest. You should probably mark this as solved. @ds5t5 definitely got this one.

ds5t5 commented 1 year ago

i have updated the converter in the PR in llama.cpp based on the latest revision in huggingface hub. It looks like the llama.cpp community wants to wait for a few PRs to be merged before Refact PR is officially merged. i see another 5-10% performance boost after my change to the latest commit of llama.cpp. @olegklimov

algora-pbc[bot] commented 1 year ago

@ds5t5: Your claim has been rewarded! We'll notify you once it is processed.

algora-pbc[bot] commented 1 year ago

🎉🎈 @ds5t5 has been awarded $2,000! 🎈🎊

AdrienLF commented 1 year ago

The docker line in the readme doesn't work for Mac/CPU, any chance to get an update on how to run it on Mac arm?

zcharef commented 3 months ago

The docker line in the readme doesn't work for Mac/CPU, any chance to get an update on how to run it on Mac arm?

any updates?

olegklimov commented 3 months ago

Yes, we'll release bring-your-own-key in a few days

dangerusslee commented 2 months ago

Yes, we'll release bring-your-own-key in a few days

Bring your own key is there, but the docker container still doesn't work on an M1.

olegklimov commented 2 months ago

You are right, it doesn't. Other servers do though, you can help us if you test it!

smallcloudai / refact

[bounty] CPU inference support, Mac M1/M2 inference support #77