KoGPT3와 연동시 품질 이슈

BangDaeng commented 2 years ago

안녕하세요 GPTJForCausalLM모델을 지원하는지 확인하려고 KoGPT3를 가지고 parallelformers 라이브러리로 인퍼런스 해보는 걸 테스트하고 있었는데요.

실행코드는 아래와 같습니다.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM 
from parallelformers import parallelize

tokenizer = AutoTokenizer.from_pretrained(
  'kakaobrain/kogpt', revision='KoGPT6B-ryan1.5b-float16',  # or float32 version: revision=KoGPT6B-ryan1.5b
  bos_token='[BOS]', eos_token='[EOS]', unk_token='[UNK]', pad_token='[PAD]', mask_token='[MASK]'
)
model = AutoModelForCausalLM.from_pretrained(
  'kakaobrain/kogpt', revision='KoGPT6B-ryan1.5b-float16',  # or float32 version: revision=KoGPT6B-ryan1.5b
  pad_token_id=tokenizer.eos_token_id,
  torch_dtype='auto'
)

parallelize(model, num_gpus=2, fp16=True, verbose='detail')

prompt = '''[공부, 학생, 힘들] => 힘들더라도 학생의 본분은 공부입니다
[시작, 떨림, 긴장] => 새로운 시작은 항상 떨리고 긴장되죠 파이팅!!
[방어, 제철, 겨울] => 겨울에는 방어가 제철이죠 방어회 어떠세요?
[겸손, 인생, 변화] => 인생은 어떻게 변할지 몰라요 항상 겸손한 태도를 갖춰야해요
[학교, 선생님, 은혜] => 학창시절 선생님의 은혜를 잊지 못해요 감사합니다.
[입사, 회사, 신입] =>'''

temperature = 0.8
max_length = 140
batch_size = 5

inputs = tokenizer([prompt]*batch_size, return_tensors="pt")
## **inputs의 경우
gen_tokens = model.generate(**inputs, do_sample=True, temperature=temperature, max_length=max_length)
## input_ids와 attention_mask를 넣을 경우
## gen_tokens = model.generate(input_ids=inputs.input_ids, attention_mask=inputs.attention_mask, do_sample=True, temperature=temperature, max_length=max_length)
generated = tokenizer.batch_decode(gen_tokens)

OUTPUT은 아래와 같습니다.

parallelformers를 쓰지 않았을 경우
parallelformers를 쓸 경우 (**inputs 일 경우)
parallelformers를 쓸 경우 (input_ids와 attention_mask만 넣을 경우)

위처럼 parallelformers로 래핑을 했을 때 품질이 떨어지는 경우가 발생하는데 (문법자체가 어긋나는 결과가 나오는..) 혹시 제가 잘못사용하고 있는건지 아니면 gpt3는 지원을 안하는 건지 물어보려 이슈 남깁니다 :)..

hyunwoongko commented 2 years ago

일단 parallelformers를 사용했을때, 성능에 영향을 줄만한 부분이 딱히 있지는 않구요. https://github.com/tunib-ai/parallelformers/blob/main/tests/gptj/test_gptj_for_causal_lm.sh GPTJ의 경우, 저는 위 파일로 테스트를 진행했는데요. 실행시 병렬화 전후의 결과가 같음을 확인했습니다. 랜덤 초기화된 모델이라 결과가 이상하게 나오지만, 병렬화 전 후에 대해서는 동일한 결과가 나올거에요. 요거 한번 테스트 해보실래요?

아래 결과를 보시면 representation의 소숫점이 아주 약간 상이하게 나오지만, 이는 분산 통신시 자주 발생하는 현상이고 생성결과에는 거의 영향을 주지 않습니다.

hyunwoongko commented 2 years ago

root@58c512eec087:/home/kevin/parallelformers/tests/gptj# sh ./test_gptj_for_causal_lm.sh
Downloading: 100%|████████████████████████████████████████████████████████████████████████████| 922/922 [00:00<00:00, 875kB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████| 325M/325M [00:17<00:00, 19.3MB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████| 611/611 [00:00<00:00, 599kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████| 779k/779k [00:01<00:00, 798kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████| 446k/446k [00:00<00:00, 466kB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████| 1.31M/1.31M [00:01<00:00, 1.09MB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████| 3.94k/3.94k [00:00<00:00, 3.43MB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████| 357/357 [00:00<00:00, 323kB/s]
Test Name: [GPTJForCausalLM]-[FP32 & Non-PF]

GPU 0 alloc: 340981760
GPU 0 cached: 350224384

GPU 1 alloc: 0
GPU 1 cached: 0

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
generate: Kevin is Definitivetsy Definitiveendaional sample�ional Ka Marxist Roses Dareelin belonged Katrina ToddAND bree angles 1400 sack sack sack sacklivithmetic coincidedة Wonders Buddha manoeuv meetingndum undermining Witches believe<|extratoken_136|> Kaplan

forward: tensor([[[ 0.1824, -0.1878, -0.1682,  ...,  0.4524,  0.0738, -0.6057],
         [ 0.1034,  0.2066, -0.2348,  ..., -0.2497,  0.1130, -0.8275]]],
       device='cuda:0')

=========================================================
Test Name: [GPTJForCausalLM]-[FP16 & Non-PF]

GPU 0 alloc: 188909056
GPU 0 cached: 207618048

GPU 1 alloc: 0
GPU 1 cached: 0

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
generate: Kevin is Definitivetsy Definitiveendaional sample�ional Ka Marxist Roses Dareelin belonged Katrina ToddAND bree angles 1400 sack sack sack sacklivithmetic coincidedة Wonders Buddha manoeuv meetingndum undermining Witches believe<|extratoken_136|> Kaplan

forward: tensor([[[ 0.1820, -0.1875, -0.1675,  ...,  0.4519,  0.0739, -0.6069],
         [ 0.1031,  0.2068, -0.2343,  ..., -0.2499,  0.1136, -0.8281]]],
       device='cuda:0')

=========================================================
Test Name: [GPTJForCausalLM]-[FP32 & PF]

GPU 0 alloc: 290616320
GPU 0 cached: 312475648

GPU 1 alloc: 290616320
GPU 1 cached: 312475648

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
generate: Kevin is Definitivetsy Definitiveendaional sample�ional Ka Marxist Roses Dareelin belonged Katrina ToddAND bree angles 1400 sack sack sack sacklivithmetic coincidedة Wonders Buddha manoeuv meetingndum undermining Witches believe<|extratoken_136|> Kaplan

forward: tensor([[[ 0.1824, -0.1878, -0.1682,  ...,  0.4524,  0.0738, -0.6057],
         [ 0.1034,  0.2066, -0.2348,  ..., -0.2497,  0.1129, -0.8275]]],
       device='cuda:0')

=========================================================
Test Name: [GPTJForCausalLM]-[FP16 & PF]

GPU 0 alloc: 163725824
GPU 0 cached: 174063616

GPU 1 alloc: 163725824
GPU 1 cached: 174063616

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
generate: Kevin is Definitivetsy Definitiveendaional sample�ional Ka Marxist Roses Dareelin belonged Katrina ToddAND bree angles 1400 sack sack sack sacklivithmetic coincidedة Wonders Buddha manoeuv meetingndum undermining Witches believe<|extratoken_136|> Kaplan

forward: tensor([[[ 0.1820, -0.1863, -0.1683,  ...,  0.4526,  0.0747, -0.6060],
         [ 0.1039,  0.2080, -0.2352,  ..., -0.2502,  0.1141, -0.8281]]],
       device='cuda:0')

=========================================================

hyunwoongko commented 2 years ago

**inputs과 input_ids, attention_mask의 결과 사이에 극명한 차이가 발생하는 부분은 다소 이해하기 어렵네요. 혹시 **inputs 값이 어떻게 찍히는지 디버깅 한번만 해보실 수 있을까요?

hyunwoongko commented 2 years ago

시간이 나시면 카브 모델 말고, EleutherAI의 GPTJ로도 테스트 해보시면 좋을 것 같네요. 이 문제가 모델에 따라 발생하는건지, Parallelformers의 GPTJ 코드 자체의 문제인지 테스트를 먼저 해봐야 해결 할 수 있을 것 같아요.

BangDaeng commented 2 years ago

**inputs과 input_ids, attention_mask의 결과 사이에 극명한 차이가 발생하는 부분은 다소 이해하기 어렵네요. 혹시 **inputs 값이 어떻게 찍히는지 디버깅 한번만 해보실 수 있을까요?

KoGPT3 tokenizer의 경우 스크린샷 2021-12-27 오후 5 37 03

EleutherAI tokenizer의 경우 스크린샷 2021-12-27 오후 5 50 55

kogpt3 토크나이저의 경우 token_type_ids가 추가로 있어서 이게 아마 노이즈가 되서 극명한 차이가 된거 같아요. input_ids와 attention_mask만 넣는게 맞는거 같습니다.

BangDaeng commented 2 years ago

시간이 나시면 카브 모델 말고, EleutherAI의 GPTJ로도 테스트 해보시면 좋을 것 같네요. 이 문제가 모델에 따라 발생하는건지, Parallelformers의 GPTJ 코드 자체의 문제인지 테스트를 먼저 해봐야 해결 할 수 있을 것 같아요.

EleutherAI로 테스트 해봤을 때 Parallelformers 적용전, 적용후로 결과값이 같은 것 확인하였습니다

다만, 카브모델의 경우 Parallelformers 적용전, 적용후로 결과 값이 같지만 스크린샷 2021-12-27 오후 6 09 10

do_sample(서치하는 방식) 파라미터 값에 따라 품질이 달라지는 거 같아요 do_sample 방식으로 서치하게끔하면 적용 후에 문법자체적으로 품질이 저하되는 것 같습니다 제 생각엔, 서치 방식에서 뭔가 품질저하가 일어나는 원인이 있는 거 같아요

Parallelformers 적용 전
Parallelformers 적용 후

oslo 라이브러리 사용해서 분산런처로 인퍼런스 했을 때는 do_sample=True로 generate해도 문법자체의 품질차이는 없었어요(스무스한 문장 생성). 감사합니다

hyunwoongko commented 2 years ago

테스트 감사합니다. 혹시 EleutherAI GPTJ는 do_sample을 했을때는 괜찮았다는 말씀이신가요? EleutherAI 모델도 do_sample에서 문제가 발생한다면 parallelformers 측에서 뭔가 실수를 하고 있는게 맞는 것 같습니다.

BangDaeng commented 2 years ago

테스트 감사합니다. 혹시 EleutherAI GPTJ는 do_sample을 했을때는 괜찮았다는 말씀이신가요? EleutherAI 모델도 do_sample에서 문제가 발생한다면 parallelformers 측에서 뭔가 실수를 하고 있는게 맞는 것 같습니다.

네,, EleutherAI 모델도 do_sample 사용 시 아래와 같이 문법이 살짝 이상해지는 문제가 발생합니다 테스트 결과 공유드립니다.

(1) EleutherAI 모델 do sample 미적용

parallelformers 사용 전
parallelformers 사용 후

(2) EleutherAI 모델 do sample 적용

parallelformers 사용 전
parallelformers 사용 후

hyunwoongko commented 2 years ago

테스트 감사합니다. 해당 현상에 대해 조사해보겠습니다.

hyunwoongko commented 2 years ago

@BangDaeng 안녕하세요. 해당 이슈는 각 프로세스가 다른 seed 값을 가져서 발생하는 문제였습니다. 만약 특정 seed 값을 설정하려면 parallelize(..., seed=seed)를 설정하시면 되구요. 만약 seed 값을 설정하지 않으면 모든 프로세스가 현재 시각을 기반으로 동일한 seed값을 갖도록 하였습니다. https://github.com/tunib-ai/parallelformers/blob/main/parallelformers/parallel/process.py#L152

pip3 install parallelformers --upgrade로 업데이트 해보시겠어요? (1.2.2 버전 설치하시면 됩니다.)

BangDaeng commented 2 years ago

안녕하세요 여러 모델로 여러번 돌려봤는데 품질 정상적으로 돌아온 것 확인하였습니다. 새벽인데 수고하셨네요 감사합니다~!! :)

hyunwoongko commented 2 years ago

네 이슈 클로징하겠습니다~

hyunwoongko commented 2 years ago

@BangDaeng https://github.com/tunib-ai/oslo/releases/tag/v1.1 오슬로에 이제 배포기능이 탑재되었습니다. 임베딩 레이어 병렬화, 커널퓨전 등 parallelformers에 비해 개선된 부분이 꽤 있으니 요걸로 해보세용 ^^

tunib-ai / parallelformers

KoGPT3와 연동시 품질 이슈 #17