shreyansh26 commented 1 year ago

Referencing #161 again.

I tried finetuning the model on the alpaca data. Given below is the script I used.

from xturing.datasets.instruction_dataset import InstructionDataset
from xturing.models import BaseModel
import sys

# Initializes the model
model = BaseModel.create("llama_lora_int4")

finetuning_config = model.finetuning_config()

print(finetuning_config)

finetuning_config.batch_size = 2
finetuning_config.gradient_accumulation_steps = 2
finetuning_config.num_train_epochs = 1
finetuning_config.learning_rate = 1e-5
finetuning_config.weight_decay = 0.01
finetuning_config.optimizer_name = "adamw"

print(model.finetuning_config())

instruction_dataset = InstructionDataset("data/alpaca_data")

model.finetune(dataset=instruction_dataset)
# Save the model
model.save("./llama_weights_alpaca")

Now when doing inference on the saved model, using the script below -

from xturing.datasets.instruction_dataset import InstructionDataset
from xturing.models import BaseModel
from datasets import load_from_disk

dataset = load_from_disk('data/alpaca_data')
model = BaseModel.load('llama_weights_alpaca')
output = model.generate(texts=["Describe the structure of an atom."])
print("Generated output by the model: {}".format(output))

again outputs an empty string -

cnbeining commented 1 year ago

Seems that loss does not really converge with int4 fine tuning - 12% into data but loss remains 10.40?

shreyansh26 commented 1 year ago

That's an issue as well. Had raised it here - https://github.com/stochasticai/xturing/issues/162

StochasticRomanAgeev commented 1 year ago

Hi @shreyansh26, Can you share your GPU model and system specs? We will try to resolve this issue asap

cnbeining commented 1 year ago

@StochasticRomanAgeev For me -

Platform: Paperspace Gradient CPU: 8 core RAM: 45G GPU: A4000 16G

shreyansh26 commented 1 year ago

I am using a n1-standard-8 VM on GCP. With a single T4 16GB.

StochasticRomanAgeev commented 1 year ago

Do you use latest version of our lib?

shreyansh26 commented 1 year ago

I think so yes. My version number is 0.1.1

cnbeining commented 1 year ago

Same here.

On Apr 28, 2023 at 5:03 AM, <Roman Ageev @.***)> wrote:

Do you use latest version of our lib?

— Reply to this email directly, view it on GitHub (https://github.com/stochasticai/xturing/issues/167#issuecomment-1527228510), or unsubscribe (https://github.com/notifications/unsubscribe-auth/AAL5DAX3CXPCX6MNPF27HO3XDOBWTANCNFSM6AAAAAAXMHD4TQ). You are receiving this because you commented.Message ID: @.***>

cloudcoder2 commented 1 year ago

The same happened to me. I tried with RTX 4090 Single gpu. In vast.ai.

s0yabean commented 1 year ago

I'm using Google colab pro with NVIDIA A100-SXM4-40GB and I got this issue with llama_lora_int8

sarthaklangde commented 1 year ago

@shreyansh26 We just released the v0.1.2 that should fix these issues. Can you please try and let us know?

joaopalotti commented 1 year ago

I had the same problem before that everybody complained with v0.1.1. The new version v0.1.2 solved it. Thanks!

shreyansh26 commented 1 year ago

Hey @sarthaklangde, thank you. After upgrading, I don't see empty strings anymore. Loss values are also no longer constant, although it is highly fluctuating, which may be due to hyperparams. That's okay. But there is a different issue I am experiencing.

I finetuned the llama_lora_int4 on Alpaca for just one epoch to test. Script is below.

from xturing.datasets.instruction_dataset import InstructionDataset
from xturing.models import BaseModel
import sys

# Initializes the model
model = BaseModel.create("llama_lora_int4")

finetuning_config = model.finetuning_config()

print(finetuning_config)

finetuning_config.batch_size = 4
finetuning_config.gradient_accumulation_steps = 2
finetuning_config.num_train_epochs = 1
finetuning_config.learning_rate = 1e-7
finetuning_config.weight_decay = 0.01
finetuning_config.optimizer_name = "adamw"

print(model.finetuning_config())

instruction_dataset = InstructionDataset("data/alpaca_data")

model.finetune(dataset=instruction_dataset)
# Save the model
model.save("./llama_weights_alpaca")

However, on testing, the original model and the finetuned model both return the same strings. I have tried for multiple inputs.

from xturing.datasets.instruction_dataset import InstructionDataset
from xturing.models import BaseModel
from datasets import load_from_disk
from tqdm import tqdm
import gc

model = BaseModel.create('llama_lora_int4')
output = model.generate(texts=["Explain Newton's laws of motion."])
print("Generated output by the model: {}".format(output))

del model
gc.collect()

model = BaseModel.load('./llama_weights_alpaca')
output = model.generate(texts=["Explain Newton's laws of motion."])
print("Generated output by the model: {}".format(output))

The output is this -

I will run another finetuning run with default learning rates to check if the issue is dependent on learning rate or not. Maybe it is a trivial issue. I'll update on this thread.

shreyansh26 commented 1 year ago

I did another run, this time with the default learning of 1e-4. Loss curve looked better with this. But during inference, I am still seeing the same issue. The output from the finetuned model is same as the one from llama_lora_int4.

These were my hyperparameters.

learning_rate=0.0001 
gradient_accumulation_steps=1 
batch_size=8 
weight_decay=0.01 
warmup_steps=50 
eval_steps=5000 
save_steps=5000 
max_length=256 
num_train_epochs=3 
logging_steps=10 
max_grad_norm=2.0 
save_total_limit=4 
optimizer_name='adamw' 
output_dir='saved_model'

StochasticRomanAgeev commented 1 year ago

Hi @shreyansh26 How many num_train_epochs you set up during training?

shreyansh26 commented 1 year ago

1 epoch. But the loss went down from 10.4 to 0.9 or so. So the model was learning something but probably something is wrong with the loading or saving part. Not sure.

StochasticRomanAgeev commented 1 year ago

Hi @shreyansh26, Doing testing, need to do finetuning loop from scratch to check if it is saving problem

StochasticRomanAgeev commented 1 year ago

Hi @shreyansh26, We released fix for this issue. You also should consider smaller weight_decay to see changes in generation. Please do pip install xturing -U

shreyansh26 commented 1 year ago

Hey @StochasticRomanAgeev, did you test the new changes at your end as well? I ask this because I am experiencing something very different now.

Reduced weight decay as you suggested, this is my training script now -

from xturing.datasets.instruction_dataset import InstructionDataset
from xturing.models import BaseModel
import sys

# Initializes the model
model = BaseModel.create("llama_lora_int4")

finetuning_config = model.finetuning_config()

print(finetuning_config)

finetuning_config.batch_size = 4
finetuning_config.gradient_accumulation_steps = 2
finetuning_config.num_train_epochs = 1
finetuning_config.weight_decay = 0.0001
finetuning_config.optimizer_name = "adamw"

print(model.finetuning_config())

instruction_dataset = InstructionDataset("../data/alpaca_data")

model.finetune(dataset=instruction_dataset)
# Save the model
model.save("./llama_weights_alpaca")

And the same test script -

from xturing.datasets.instruction_dataset import InstructionDataset
from xturing.models import BaseModel
from datasets import load_from_disk
from tqdm import tqdm
import gc

model = BaseModel.create('llama_lora_int4')
output = model.generate(texts=["Explain Newton's laws of motion."])
print("Generated output by the model: {}".format(output))

del model
gc.collect()

model = BaseModel.load('./llama_weights_alpaca')
output = model.generate(texts=["Explain Newton's laws of motion."])
print("Generated output by the model: {}".format(output))

The output from the new model is some garbage values it seems -

This is the case with other prompts I tried as well. The outputs are not the same but still a combination of '1', '\n' and some alphabets in between.

The loss did decrease similarly as last time, so some training did happen, but not sure why the outputs are like this.

codeloki15 commented 1 year ago

Just a question can I train a model with kaggle GPU. As soon as the training starts I am getting the error with can anyone take a look and guide me?

Found 2 unique N values. Warming up autotune cache ... 0%| | 0/12 [00:00<?, ?it/s]/usr/bin/ld: cannot find -lcuda collect2: error: ld returned 1 exit status 0%| | 0/12 [00:00<?, ?it/s]

KeyError Traceback (most recent call last) File :21, in matmul_248_kernel(a_ptr, b_ptr, c_ptr, scales_ptr, zeros_ptr, g_ptr, M, N, K, bits, maxq, stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn, stride_scales, stride_zeros, BLOCK_SIZE_M, BLOCK_SIZE_N, BLOCK_SIZE_K, GROUP_SIZE_M, grid, num_warps, num_stages, extern_libs, stream, warmup)

KeyError: ('2-.-0-.-0-ccf5c87a2f99551e7274ef3e5605df6e-d6252949da17ceb5f3a278a70250af13-3b85c7bef5f0a641282f3b73af50f599-3d2aedeb40d6d81c66a42791e268f98b-3498c340fd4b6ee7805fd54b882a04f5-e1f133f98d04093da2078dfc51c36b72-b26258bf01f839199e39d64851821f26-d7c06e3b46e708006c15224aac7a1378-f585402118c8a136948ce0a49cfe122c', (torch.float16, torch.int32, torch.float16, torch.float16, torch.int32, torch.int32, 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), (256, 64, 32, 8), (True, True, True, True, True, True, (False, True), (True, False), (True, False), (False, False), (False, False), (True, False), (False, True), (True, False), (False, True), (True, False), (False, True), (True, False), (True, False)))

During handling of the above exception, another exception occurred:

CalledProcessError Traceback (most recent call last) Cell In[9], line 6 3 import sys 5 # Initializes the model ----> 6 model = BaseModel.create("llama_lora_int4") 8 finetuning_config = model.finetuning_config() 10 print(finetuning_config)

File /opt/conda/lib/python3.10/site-packages/xturing/registry.py:14, in BaseParent.create(cls, class_key, *args, kwargs) 12 @classmethod 13 def create(cls, class_key, *args, *kwargs): ---> 14 return cls.registry[class_key](args, kwargs)

File /opt/conda/lib/python3.10/site-packages/xturing/models/llama.py:72, in LlamaLoraInt4.init(self, weights_path) 71 def init(self, weights_path: Optional[str] = None): ---> 72 super().init(LlamaLoraInt4Engine.config_name, weights_path)

File /opt/conda/lib/python3.10/site-packages/xturing/models/causal.py:215, in CausalLoraInt8Model.init(self, engine, weights_path) 213 def init(self, engine: str, weights_path: Optional[str] = None): 214 assert_not_cpu_int8() --> 215 super().init(engine, weights_path)

File /opt/conda/lib/python3.10/site-packages/xturing/models/causal.py:193, in CausalLoraModel.init(self, engine, weights_path) 192 def init(self, engine: str, weights_path: Optional[str] = None): --> 193 super().init(engine, weights_path)

File /opt/conda/lib/python3.10/site-packages/xturing/models/causal.py:28, in CausalModel.init(self, engine, weights_path) 27 def init(self, engine: str, weights_path: Optional[str] = None): ---> 28 self.engine = BaseEngine.create(engine, weights_path) 30 self.model_name = engine.replace("_engine", "") 32 # Finetuning config

File /opt/conda/lib/python3.10/site-packages/xturing/registry.py:14, in BaseParent.create(cls, class_key, *args, kwargs) 12 @classmethod 13 def create(cls, class_key, *args, *kwargs): ---> 14 return cls.registry[class_key](args, kwargs)

File /opt/conda/lib/python3.10/site-packages/xturing/engines/llama_engine.py:166, in LlamaLoraInt4Engine.init(self, weights_path) 161 state_dict = torch.load( 162 weights_path / Path("pytorch_model.bin"), map_location="cpu" 163 ) 165 if warmup_autotune: --> 166 autotune_warmup(model) 168 model.seqlen = 2048 170 model.gptq = True

File /opt/conda/lib/python3.10/site-packages/xturing/engines/quant_utils/quant.py:862, in autotune_warmup(model, transpose) 860 for n, (k, qweight, scales, qzeros, g_idx, bits, maxq) in n_values.items(): 861 a = torch.randn(m, k, dtype=torch.float16, device="cuda") --> 862 matmul248(a, qweight, scales, qzeros, g_idx, bits, maxq) 863 if transpose: 864 a = torch.randn(m, n, dtype=torch.float16, device="cuda")

File /opt/conda/lib/python3.10/site-packages/xturing/engines/quant_utils/quant.py:635, in matmul248(input, qweight, scales, qzeros, g_idx, bits, maxq) 628 output = torch.empty( 629 (input.shape[0], qweight.shape[1]), device="cuda", dtype=torch.float16 630 ) 631 grid = lambda META: ( 632 triton.cdiv(input.shape[0], META["BLOCK_SIZE_M"]) 633 * triton.cdiv(qweight.shape[1], META["BLOCK_SIZE_N"]), 634 ) --> 635 matmul_248_kernel[grid]( 636 input, 637 qweight, 638 output, 639 scales, 640 qzeros, 641 g_idx, 642 input.shape[0], 643 qweight.shape[1], 644 input.shape[1], 645 bits, 646 maxq, 647 input.stride(0), 648 input.stride(1), 649 qweight.stride(0), 650 qweight.stride(1), 651 output.stride(0), 652 output.stride(1), 653 scales.stride(0), 654 qzeros.stride(0), 655 ) 656 return output

File /opt/conda/lib/python3.10/site-packages/xturing/engines/quant_utils/custom_autotune.py:89, in Autotuner.run(self, *args, *kwargs) 87 pruned_configs = self.prune_configs(kwargs) 88 bench_start = time.time() ---> 89 timings = {config: self._bench(args, config=config, **kwargs) 90 for config in pruned_configs} 91 timings = {k:v for k,v in timings.items() if v != float('inf')} 92 bench_end = time.time()

File /opt/conda/lib/python3.10/site-packages/xturing/engines/quant_utils/custom_autotune.py:89, in (.0) 87 pruned_configs = self.prune_configs(kwargs) 88 bench_start = time.time() ---> 89 timings = {config: self._bench(*args, config=config, **kwargs) 90 for config in pruned_configs} 91 timings = {k:v for k,v in timings.items() if v != float('inf')} 92 bench_end = time.time()

File /opt/conda/lib/python3.10/site-packages/xturing/engines/quant_utils/custom_autotune.py:71, in Autotuner._bench(self, config, *args, *meta) 67 self.fn.run(args, num_warps=config.num_warps, num_stages=config.num_stages, **current) 68 try: 69 # In testings using only 40 reps seems to be close enough and it appears to be what PyTorch uses 70 # PyTorch also sets fast_flush to True, but I didn't see any speedup so I'll leave the default ---> 71 return triton.testing.do_bench(kernel_call, rep=40) 72 except triton.compiler.OutOfResources: 73 return float('inf')

File /opt/conda/lib/python3.10/site-packages/triton/testing.py:143, in do_bench(fn, warmup, rep, grad_to_none, percentiles, record_clocks, fast_flush) 124 """ 125 Benchmark the runtime of the provided function. By default, return the median runtime of :code:fn along with 126 the 20-th and 80-th performance percentile. (...) 139 :type fast_flush: bool 140 """ 142 # Estimate the runtime of the function --> 143 fn() 144 torch.cuda.synchronize() 145 start_event = torch.cuda.Event(enable_timing=True)

File /opt/conda/lib/python3.10/site-packages/xturing/engines/quant_utils/custom_autotune.py:67, in Autotuner._bench..kernel_call() 65 config.pre_hook(self.nargs) 66 self.hook(args) ---> 67 self.fn.run(*args, num_warps=config.num_warps, num_stages=config.num_stages, **current)

File :41, in matmul_248_kernel(a_ptr, b_ptr, c_ptr, scales_ptr, zeros_ptr, g_ptr, M, N, K, bits, maxq, stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn, stride_scales, stride_zeros, BLOCK_SIZE_M, BLOCK_SIZE_N, BLOCK_SIZE_K, GROUP_SIZE_M, grid, num_warps, num_stages, extern_libs, stream, warmup)

File /opt/conda/lib/python3.10/site-packages/triton/compiler.py:1588, in compile(fn, kwargs) 1585 first_stage = list(stages.keys()).index(ir) 1587 # cache manager -> 1588 so_path = make_stub(name, signature, constants) 1589 # create cache manager 1590 fn_cache_manager = CacheManager(make_hash(fn, kwargs))

File /opt/conda/lib/python3.10/site-packages/triton/compiler.py:1477, in make_stub(name, signature, constants) 1475 with open(src_path, "w") as f: 1476 f.write(src) -> 1477 so = _build(name, src_path, tmpdir) 1478 with open(so, "rb") as f: 1479 so_cache_manager.put(f.read(), so_name, binary=True)

File /opt/conda/lib/python3.10/site-packages/triton/compiler.py:1392, in _build(name, src, srcdir) 1390 cc_cmd = [cc, src, "-O3", f"-I{cu_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC", "-lcuda", "-o", so] 1391 cc_cmd += [f"-L{dir}" for dir in cuda_lib_dirs] -> 1392 ret = subprocess.check_call(cc_cmd) 1394 if ret == 0: 1395 return so

File /opt/conda/lib/python3.10/subprocess.py:369, in check_call(*popenargs, **kwargs) 367 if cmd is None: 368 cmd = popenargs[0] --> 369 raise CalledProcessError(retcode, cmd) 370 return 0

CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpk73auo4o/main.c', '-O3', '-I/usr/local/cuda/include', '-I/opt/conda/include/python3.10', '-I/tmp/tmpk73auo4o', '-shared', '-fPIC', '-lcuda', '-o', '/tmp/tmpk73auo4o/matmul_248_kernel.cpython-310-x86_64-linux-gnu.so']' returned non-zero exit status 1.

codeloki15 commented 1 year ago

Just a question can I train a model with kaggle GPU. As soon as the training starts I am getting the error with can anyone take a look and guide me?

Found 2 unique N values.

Warming up autotune cache ... 0%| | 0/12 [00:00<?, ?it/s]/usr/bin/ld: cannot find -lcuda collect2: error: ld returned 1 exit status 0%| | 0/12 [00:00<?, ?it/s] KeyError Traceback (most recent call last) File :21, in matmul_248_kernel(a_ptr, b_ptr, c_ptr, scales_ptr, zeros_ptr, g_ptr, M, N, K, bits, maxq, stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn, stride_scales, stride_zeros, BLOCK_SIZE_M, BLOCK_SIZE_N, BLOCK_SIZE_K, GROUP_SIZE_M, grid, num_warps, num_stages, extern_libs, stream, warmup)

KeyError: ('2-.-0-.-0-ccf5c87a2f99551e7274ef3e5605df6e-d6252949da17ceb5f3a278a70250af13-3b85c7bef5f0a641282f3b73af50f599-3d2aedeb40d6d81c66a42791e268f98b-3498c340fd4b6ee7805fd54b882a04f5-e1f133f98d04093da2078dfc51c36b72-b26258bf01f839199e39d64851821f26-d7c06e3b46e708006c15224aac7a1378-f585402118c8a136948ce0a49cfe122c', (torch.float16, torch.int32, torch.float16, torch.float16, torch.int32, torch.int32, 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), (256, 64, 32, 8), (True, True, True, True, True, True, (False, True), (True, False), (True, False), (False, False), (False, False), (True, False), (False, True), (True, False), (False, True), (True, False), (False, True), (True, False), (True, False)))

During handling of the above exception, another exception occurred:

CalledProcessError Traceback (most recent call last) Cell In[9], line 6 3 import sys 5 # Initializes the model ----> 6 model = BaseModel.create("llama_lora_int4") 8 finetuning_config = model.finetuning_config() 10 print(finetuning_config)

File /opt/conda/lib/python3.10/site-packages/xturing/registry.py:14, in BaseParent.create(cls, class_key, *args, kwargs) 12 @classmethod 13 def create(cls, class_key, *args, *kwargs): ---> 14 return cls.registry[class_key](args, kwargs)

File /opt/conda/lib/python3.10/site-packages/xturing/models/llama.py:72, in LlamaLoraInt4.init(self, weights_path) 71 def init(self, weights_path: Optional[str] = None): ---> 72 super().init(LlamaLoraInt4Engine.config_name, weights_path)

File /opt/conda/lib/python3.10/site-packages/xturing/models/causal.py:215, in CausalLoraInt8Model.init(self, engine, weights_path) 213 def init(self, engine: str, weights_path: Optional[str] = None): 214 assert_not_cpu_int8() --> 215 super().init(engine, weights_path)

File /opt/conda/lib/python3.10/site-packages/xturing/models/causal.py:193, in CausalLoraModel.init(self, engine, weights_path) 192 def init(self, engine: str, weights_path: Optional[str] = None): --> 193 super().init(engine, weights_path)

File /opt/conda/lib/python3.10/site-packages/xturing/models/causal.py:28, in CausalModel.init(self, engine, weights_path) 27 def init(self, engine: str, weights_path: Optional[str] = None): ---> 28 self.engine = BaseEngine.create(engine, weights_path) 30 self.model_name = engine.replace("_engine", "") 32 # Finetuning config

File /opt/conda/lib/python3.10/site-packages/xturing/registry.py:14, in BaseParent.create(cls, class_key, *args, kwargs) 12 @classmethod 13 def create(cls, class_key, *args, *kwargs): ---> 14 return cls.registry[class_key](args, kwargs)

File /opt/conda/lib/python3.10/site-packages/xturing/engines/llama_engine.py:166, in LlamaLoraInt4Engine.init(self, weights_path) 161 state_dict = torch.load( 162 weights_path / Path("pytorch_model.bin"), map_location="cpu" 163 ) 165 if warmup_autotune: --> 166 autotune_warmup(model) 168 model.seqlen = 2048 170 model.gptq = True

File /opt/conda/lib/python3.10/site-packages/xturing/engines/quant_utils/quant.py:862, in autotune_warmup(model, transpose) 860 for n, (k, qweight, scales, qzeros, g_idx, bits, maxq) in n_values.items(): 861 a = torch.randn(m, k, dtype=torch.float16, device="cuda") --> 862 matmul248(a, qweight, scales, qzeros, g_idx, bits, maxq) 863 if transpose: 864 a = torch.randn(m, n, dtype=torch.float16, device="cuda")

File /opt/conda/lib/python3.10/site-packages/xturing/engines/quant_utils/quant.py:635, in matmul248(input, qweight, scales, qzeros, g_idx, bits, maxq) 628 output = torch.empty( 629 (input.shape[0], qweight.shape[1]), device="cuda", dtype=torch.float16 630 ) 631 grid = lambda META: ( 632 triton.cdiv(input.shape[0], META["BLOCK_SIZE_M"]) 633 * triton.cdiv(qweight.shape[1], META["BLOCK_SIZE_N"]), 634 ) --> 635 matmul_248_kernel[grid]( 636 input, 637 qweight, 638 output, 639 scales, 640 qzeros, 641 g_idx, 642 input.shape[0], 643 qweight.shape[1], 644 input.shape[1], 645 bits, 646 maxq, 647 input.stride(0), 648 input.stride(1), 649 qweight.stride(0), 650 qweight.stride(1), 651 output.stride(0), 652 output.stride(1), 653 scales.stride(0), 654 qzeros.stride(0), 655 ) 656 return output

File /opt/conda/lib/python3.10/site-packages/xturing/engines/quant_utils/custom_autotune.py:89, in Autotuner.run(self, *args, *kwargs) 87 pruned_configs = self.prune_configs(kwargs) 88 bench_start = time.time() ---> 89 timings = {config: self._bench(args, config=config, **kwargs) 90 for config in pruned_configs} 91 timings = {k:v for k,v in timings.items() if v != float('inf')} 92 bench_end = time.time()

File /opt/conda/lib/python3.10/site-packages/xturing/engines/quant_utils/custom_autotune.py:89, in (.0) 87 pruned_configs = self.prune_configs(kwargs) 88 bench_start = time.time() ---> 89 timings = {config: self._bench(*args, config=config, **kwargs) 90 for config in pruned_configs} 91 timings = {k:v for k,v in timings.items() if v != float('inf')} 92 bench_end = time.time()

File /opt/conda/lib/python3.10/site-packages/xturing/engines/quant_utils/custom_autotune.py:71, in Autotuner._bench(self, config, *args, *meta) 67 self.fn.run(args, num_warps=config.num_warps, num_stages=config.num_stages, **current) 68 try: 69 # In testings using only 40 reps seems to be close enough and it appears to be what PyTorch uses 70 # PyTorch also sets fast_flush to True, but I didn't see any speedup so I'll leave the default ---> 71 return triton.testing.do_bench(kernel_call, rep=40) 72 except triton.compiler.OutOfResources: 73 return float('inf')

File /opt/conda/lib/python3.10/site-packages/triton/testing.py:143, in do_bench(fn, warmup, rep, grad_to_none, percentiles, record_clocks, fast_flush) 124 """ 125 Benchmark the runtime of the provided function. By default, return the median runtime of :code:fn along with 126 the 20-th and 80-th performance percentile. (...) 139 :type fast_flush: bool 140 """ 142 # Estimate the runtime of the function --> 143 fn() 144 torch.cuda.synchronize() 145 start_event = torch.cuda.Event(enable_timing=True)

File /opt/conda/lib/python3.10/site-packages/xturing/engines/quant_utils/custom_autotune.py:67, in Autotuner._bench..kernel_call() 65 config.pre_hook(self.nargs) 66 self.hook(args) ---> 67 self.fn.run(*args, num_warps=config.num_warps, num_stages=config.num_stages, **current)

File :41, in matmul_248_kernel(a_ptr, b_ptr, c_ptr, scales_ptr, zeros_ptr, g_ptr, M, N, K, bits, maxq, stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn, stride_scales, stride_zeros, BLOCK_SIZE_M, BLOCK_SIZE_N, BLOCK_SIZE_K, GROUP_SIZE_M, grid, num_warps, num_stages, extern_libs, stream, warmup)

File /opt/conda/lib/python3.10/site-packages/triton/compiler.py:1588, in compile(fn, kwargs) 1585 first_stage = list(stages.keys()).index(ir) 1587 # cache manager -> 1588 so_path = make_stub(name, signature, constants) 1589 # create cache manager 1590 fn_cache_manager = CacheManager(make_hash(fn, kwargs))

File /opt/conda/lib/python3.10/site-packages/triton/compiler.py:1477, in make_stub(name, signature, constants) 1475 with open(src_path, "w") as f: 1476 f.write(src) -> 1477 so = _build(name, src_path, tmpdir) 1478 with open(so, "rb") as f: 1479 so_cache_manager.put(f.read(), so_name, binary=True)

File /opt/conda/lib/python3.10/site-packages/triton/compiler.py:1392, in _build(name, src, srcdir) 1390 cc_cmd = [cc, src, "-O3", f"-I{cu_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC", "-lcuda", "-o", so] 1391 cc_cmd += [f"-L{dir}" for dir in cuda_lib_dirs] -> 1392 ret = subprocess.check_call(cc_cmd) 1394 if ret == 0: 1395 return so

File /opt/conda/lib/python3.10/subprocess.py:369, in check_call(*popenargs, **kwargs) 367 if cmd is None: 368 cmd = popenargs[0] --> 369 raise CalledProcessError(retcode, cmd) 370 return 0

CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpk73auo4o/main.c', '-O3', '-I/usr/local/cuda/include', '-I/opt/conda/include/python3.10', '-I/tmp/tmpk73auo4o', '-shared', '-fPIC', '-lcuda', '-o', '/tmp/tmpk73auo4o/matmul_248_kernel.cpython-310-x86_64-linux-gnu.so']' returned non-zero exit status 1.

from xturing.datasets.instruction_dataset import InstructionDataset from xturing.models import BaseModel import sys

Initializes the model

model = BaseModel.create("llama_lora_int4")

finetuning_config = model.finetuning_config()

print(finetuning_config)

finetuning_config.batch_size = 4 finetuning_config.gradient_accumulation_steps = 2 finetuning_config.num_train_epochs = 1 finetuning_config.learning_rate = 1e-7 finetuning_config.weight_decay = 0.01 finetuning_config.optimizer_name = "adamw"

print(model.finetuning_config())

instruction_dataset = InstructionDataset("/kaggle/working/alpaca_data")

model.finetune(dataset=instruction_dataset)

Save the model

model.save("/kaggle/working/llama_weights_alpaca")

gredin commented 1 year ago

The output from the new model is some garbage values it seems

This is the case with other prompts I tried as well. The outputs are not the same but still a combination of '1', '\n' and some alphabets in between.

The loss did decrease similarly as last time, so some training did happen, but not sure why the outputs are like this.

I can confirm the description given by shreyansh26. Training on 10k input/output pairs (3 epochs, about 2 hours and a half with an RTX 4070), fast loss decrease, inconsistent weird outputs in the end. I have used the last version of xturing (by the way, I had to upgrade numpy and click to make it work).

StochasticRomanAgeev commented 1 year ago

Hi again, @gredin @cloudcoder2 We will replace int4 version with newer implementation in some time For now you can try usage of int8 version with lora.

tushar2407 commented 1 year ago

Hi @gredin @cloudcoder2, we have added int4 with the newer implementation on the latest update.
Hope it helps!

stochasticai / xTuring

LLaMA LoRA INT4 generates empty strings even after finetuning #167

Found 2 unique N values. Warming up autotune cache ... 0%| | 0/12 [00:00<?, ?it/s]/usr/bin/ld: cannot find -lcuda collect2: error: ld returned 1 exit status 0%| | 0/12 [00:00<?, ?it/s]

Found 2 unique N values.

Initializes the model

Save the model