randaller / llama-chat

Chat with Meta's LLaMA models at home made easy
GNU General Public License v3.0
834 stars 118 forks source link

Hello, how to make the output be trimed? #2

Closed lucasjinreal closed 1 year ago

lucasjinreal commented 1 year ago
图片

curretnly outputs all inference steps and results, is there a way print only answer?

randaller commented 1 year ago

There is a way, it is open-source! :) I decided to left this way of outputs, because I'm inferencing 65B model sometimes and wish to see the generation process, to be able to terminate it, as it goes very slowly.

lucasjinreal commented 1 year ago

@randaller how many 30B cpu mem needed to inference? I adopted this to inference on GPU with 16GB mem, seems fast, but the promots not really well

randaller commented 1 year ago

how many 30B cpu mem needed to inference?

@jinfagang About 70 gigabytes of RAM needed for 30B model. It should work with any amount of RAM, just very slowly, systems will use swap file hardly.