turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.45k stars 257 forks source link

Add Huggingface tokenzier support #189

Closed DOGEwbx closed 9 months ago

DOGEwbx commented 9 months ago

Add logic to decide use huggingface tokenizer or sentence piece tokenizer. It can support models using huggingface tokenizer like Falcon and Deepseek Coder

CyberTimon commented 9 months ago

Thank you so much for this work @DOGEwbx . I'm waiting for deepseek coder support all the time. This is very helpful.

CyberTimon commented 9 months ago

Is creating Exl2 quants with this also possible?

DOGEwbx commented 9 months ago

@CyberTimon Thanks for your interest on our work. I haven’t run testing on exl2 quants file but as all the modifications are on the tokenizer part, I don't think there will be problems on the specific data format. I'm not very familiar with model quant techniques, if you could give me some model checkpoints or conversion scripts so that I can do some tests.

turboderp commented 9 months ago

At the very least, this is going to take some time to review.

Transformers is a massive dependency to include just to support one model (Falcon still wouldn't work as there are other architectural differences).

As for remote code, my guess would be that 90% of users are unaware of the risks involved, so it should at least be opt-in.

I'll need a moment to think about it, to test that this doesn't break functionality like constrained sampling, and make sure there really isn't a better way.

DOGEwbx commented 9 months ago

Thanks for your reply. For the transformers dependency issue, use the Tokenizers module instead would be a solution( but it will support Fast tokenizers only) And you're right for trust_remote_code because I don't know the risks either.....We could disable it since the model should be on local.

SinanAkkoyun commented 9 months ago

Is there any specific way to use the fork? With pip install transformers the fork does not work for me:

python examples/chat.py -m../../models/deepseek/deepseek-coder-1.3b-instruct-GPTQ/ -mode deepseek
 -- Model: ../../models/deepseek/deepseek-coder-1.3b-instruct-GPTQ/
 -- Options: ['rope_scale 1.0', 'rope_alpha 1.0']
 -- Loading model...
 -- Loading tokenizer...
 -- Prompt format: deepseek
 -- System prompt:

You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.

User: How can I tell the time in python?

���ııiiiii iii iv i i ii io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io

6.7B shows very similar behaviour, but most of the time results in an invisible output loop in the chat example

I get the same behaviour no matter what prompt format (also tested the deepseek instruct format)

Maybe I am just doing something wrong, I'd appreciate help

turboderp commented 9 months ago

@SinanAkkoyun The model seems to use linear a RoPE scaling factor of 4. I've been able to get coherent output out of the 1.3B model at least, using that.

@DOGEwbx The Tokenizers library seems like a more reasonable dependency, especially if it's optional. It largely mirrors Transformers, so it should be possible to adapt it to the code in this PR. There are still a few things I need to sort out and verify, like how control symbols are encoded, optional BOS/EOS tokens, that the vocabulary is preprocessed correctly, how UTF-8 characters are emitted and so on. I'll get to that in a few hours.

It's definitely not a trivial issue. I see over on the llama.cpp repo a whole bunch of people have been working on it for some weeks now.

As for remote code, the issue is that with the option enabled, AutoModel.from_pretrained and apparently also AutoTokenizer.from_pretrained will import and run architectures distributed with models. The architectures are defined as Python code, and with no sandboxing this code has access to your entire userspace. So it can steal browser cookies, take screenshots, read the clipboard, etc. Crucially, users usually don't think of downloading and trying out a new model as downloading and installing software, especially in UIs like text-generation-webui that let you download and run models with a couple of clicks.

SinanAkkoyun commented 9 months ago

@turboderp Thank youu, 6.7B is working coherently :)

SinanAkkoyun commented 9 months ago

@turboderp However, I can't seem to get 1.3B to output coherent responses. What params did you use?

EXL2 GPTQ:

python examples/chat.py -m../../models/deepseek/deepseek-coder-1.3b-instruct-GPTQ/ -mode deepseek -rs 4
 -- Model: ../../models/deepseek/deepseek-coder-1.3b-instruct-GPTQ/
 -- Options: ['rope_scale 4.0', 'rope_alpha 1.0']
 -- Loading model...
 -- Loading tokenizer...
 -- Prompt format: deepseek
 -- System prompt:

You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.

User: How can you tell the time in python but backwards?

In Python 3, there is no builtin function that directly provides a date/time object's creation timestamp (like UNIX_TIMESTAMP or UTCINFO) back towards present as it would if we used `datetime` module from standard library like this :

   from datetime import datetime     #importing necessary libraries         .           print(f"Current Timebackwards {now - timed  (secon
   elta(seconds=1)}")       exit()        else
                                                    ____         ^______      / \              |         |           |
               TIME               0                   6947825                      tic ()                   def now():
      return                                  .....                                nt                           ~                         ~                             ...                           __maintainable..>                                        PEP
   ....                  stderr                       errno                        sys            │
 ├───�                                               �                                           <-- OSError                              EOFError                               ModuleNotFoundError                                   tracebacks for debugging
                 eeeffttsssyyyy yrry rrrty jjjk kkk lll mmnn ssst ddmmm MMDDYYY hhMMSSXZtz AABBCCXXV VVARAARAAAPPPPZZQTTJKLLLAACCUURRNNOOEEEENNNNSRRRLLEEEEEWROOOORWWWMERLFHGDOIAMAEEDAIEMANESANTTATOMTAASIISOTUTULISIONETIMELIBLIOWEITIVIRTHLYDECKSHAKINREIDSMARTADMINISTERTIANTECHNIQUEANDINTERACTIVEFUNCTIONSLACKMELODYSBERGINSONSOCIALSCIENTIFICCOMPLETECOMMONCONTRASTNEWSPIPELINECUSTOMECREATIONPROCESSORDATAVERSIENTIDEFIABLEPREDICTPERCEPTORYOURSPACEUSERSAVESSLHAILTONLINEPARTIALDEREFINEFORCEPSIZEXPARAMERRONEIGHTSUMMARYDISPLOSIONSHELPFAKEIDENTITYKEYERRORVALUESUCURIZEMPTYTOSTRINGFORMATTERUNAVAILVALUEATTRNAMEOUTPUTTYPEJSONBUNDLESINGLETOPATHSYNERRAISEIOFFILEPATHREADFROMCURRENTDIRWRITEFILECREATESESSIONSFMLFLASKJOINTFRAMEGRAPHDATABASECONFIGUREPYTESTCAUSEMOVEALLCAPSUCLUBNETMARKSIGNALEMENTSUBSCRIPTVisualizationDataFrameGROUPPOSITIONDATASTRATEFETYPEINDIAUTHORMACHIEFPROMPTUSERGETHOURLONGTEXTCOLORADDHEADMANAGMENTOFFSETCOHERENCEFERMIEXTENDPRIMARYSOURCEWHICHMODELEDGELOADDBSQLPARSERQLIMITPROPDRUIDOWNREFCOUNTSELECTRETURNCONTENTSIGNPOSTFIXMSGSODDAHLTFULLDEBUGLOGGERMAINSTAPIREGRESSORSUPPORTSTATUSBORDERBOUNDARYCRASHOPTIMIZEEXTENDPASSWORDCHECKPOSTMULTITHREADARGUENCESOLCALLBACKLISTENABLEDFREEFORESPECTORTICKMAXFEATUREDSIMPUTFIRSTROWINDEXCOLLECTPROJECTOBJECTSKILLSTATELOWESTGLBLIGHTMONOCLEARNBASECOMICSCHARTDRILLHTMLTAGCLASSIFICTYPECPythonBaseExceptionFileExistsErr@Python Base Exception Fatal Error In File error

Or is this just due to 4bit quantization? The bf16 model responds with great answers for it's 1.3b size

turboderp commented 9 months ago

I think maybe you're just asking too much of a tiny model. And quantization is known to affect smaller models more severely anyway. Remember you can also just run the FP16 version to compare.

turboderp commented 9 months ago

There. I rewrote it to use the Tokenizers library instead, as an optional dependency, and it seems to run okay now. It seems to consistently encode and decode the same as a HF AutoTokenizer. Encoding seems to work correctly during quantization as well.

I also added a workaround for the Tokenizer bug where some added tokens would decode incorrectly. Still need to test it with some of the other models that lack a SentencePiece tokenizer model.

CyberTimon commented 9 months ago

Thank you

SinanAkkoyun commented 9 months ago

I think maybe you're just asking too much of a tiny model. And quantization is known to affect smaller models more severely anyway. Remember you can also just run the FP16 version to compare.

Yes, that's what puzzled me, the FP16 model ran perfectly fine and conquered most basic coding tasks easily It would be cool to have a super fast model capable of basic coding, but perhaps 4bit is just not enough, I just wanted to make sure that it has nothing to do with hyperparameters

There. I rewrote it to use the Tokenizers library instead

Thank you so much!