instructlab / sdg

Python library for Synthetic Data Generation
Apache License 2.0
12 stars 28 forks source link

Generated data has single quote or \n in the beggining of the setence #181

Open tsailiming opened 1 month ago

tsailiming commented 1 month ago

Describe the bug Looking at the file messages_merlinite-7b-lab-Q4_K_M_2024-07-21T05_02_22.jsonl, there are numerous of such content

{"messages": [{"content": "What is the production period of the DeLorean DMC-12?", "role": "user"}, {"content": "'\nAnswer: The DeLorean DMC-12 was manufactured from 1981 to 1983.", "role": "assistant"}], "metadata": "{\"system\": \"You are an AI language model developed by IBM Research. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior.\"}"} 
{"messages": [{"content": "What is the name of the first American to run for President with an openly socialist platform?", "role": "user"}, {"content": "'\nResponse: Eugene Victor Debs was the first American to run for President with an openly socialist platform.", "role": "assistant"}], "metadata": "{\"system\": \"You are an AI language model developed by IBM Research. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior.\"}"}

The alignment process tends to cause the model to return a response that replied with a ' or "Answer:"

I am not sure whether this is a bug?

To Reproduce Steps to reproduce the behavior:

  1. I am using the following
    instructlab==0.18.0a4
    instructlab-dolomite==0.1.1
    instructlab-eval==0.1.0
    instructlab-quantize==0.1.0
    instructlab-schema==0.2.0
    instructlab-sdg==0.1.2
    instructlab-training==0.3.0

    2 A sample knowledge in taxonomy/knowledge/parasol/overview/qna.yaml

https://raw.githubusercontent.com/gshipley/backToTheFuture/main/qna.yaml

$ ilab -vvv generate --num-instructions 500 --num-cpus 4  \
--model='/home/instruct/.local/share/instructlab/models/merlinite-7b-lab-Q4_K_M.gguf'

Expected behavior

Screenshots

Device Info (please complete the following information):

]$ ilab sysinfo
You are using an aliased command, this will be deprecated in a future release. Please consider using `ilab system info` instead
sys.version: 3.11.7 (main, May 16 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
sys.platform: linux
os.name: posix
platform.release: 5.14.0-427.24.1.el9_4.x86_64
platform.machine: x86_64
os-release.ID: rhel
os-release.VERSION_ID: 9.4
os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
instructlab.version: 0.18.0a4
instructlab-dolomite.version: 0.1.1
instructlab-eval.version: 0.1.0
instructlab-quantize.version: 0.1.0
instructlab-schema.version: 0.2.0
instructlab-sdg.version: 0.1.2
instructlab-training.version: 0.3.0
torch.version: 2.3.1+cu121
torch.backends.cpu.capability: AVX2
torch.version.cuda: 12.1
torch.version.hip: None
torch.cuda.available: True
torch.backends.cuda.is_built: True
torch.backends.mps.is_built: False
torch.backends.mps.is_available: False
torch.cuda.bf16: True
torch.cuda.current: 0
torch.cuda.0.name: NVIDIA A10G
torch.cuda.0.free: 21.8
torch.cuda.0.total: 22.1
torch.cuda.0.capability: 8.6
llama_cpp_python.version: 0.2.79
llama_cpp_python.supports_gpu_offload: True

Additional context

russellb commented 1 month ago

I moved this issue over to the sdg repo as that’s where the relevant code is.

Note to self / other devs: This is with the “simple” pipeline and the default merlinite model.

bbrowning commented 1 month ago

I believe this is generally resolved by using a larger teacher model, such as mixtral-8x7b instead of merlinite. Or, at least for me, I also saw this when using merlinite but it went away when swapping over to mixtral. I know that's not a great answer, as mixtral takes a lot more resources to run for inference than merlinite. Perhaps there is something we could do to optimize the simple pipeline used by merlinite to reduce the frequency of this happening, and that may be worth investigating.