[공지] hatespeech-1 dataset, baseline 공개

bluebrush commented 4 years ago
17:40분~50분에 추가되었습니다.
baseline : https://github.com/AI-RUSH-Operation/NAVER-AI-RUSH/tree/master/hate_speech
thejungwon commented 4 years ago
nsml run -e main.py -m "A good message" -v -d hatespeech-1
여기서 -v가 없어야 동작하는 것 같습니다!
bluebrush commented 4 years ago
submit까지 전체 진행 과정.

ubuntu16@ubuntu16-VirtualBox:~/airushdemo/src/NAVER-AI-RUSH/demo-hatespeech-1$ git clone https://github.com/AI-RUSH-Operation/NAVER-AI-RUSH.git
'NAVER-AI-RUSH'에 복제합니다...
remote: Enumerating objects: 71, done.
remote: Counting objects: 100% (71/71), done.
remote: Compressing objects: 100% (56/56), done.
remote: Total 71 (delta 17), reused 59 (delta 11), pack-reused 0
오브젝트 묶음 푸는 중: 100% (71/71), 완료.
ubuntu16@ubuntu16-VirtualBox:~/airushdemo/src/NAVER-AI-RUSH/demo-hatespeech-1$ cd NAVER-AI-RUSH/
.git/        .github/     hate_speech/ spam/        
ubuntu16@ubuntu16-VirtualBox:~/airushdemo/src/NAVER-AI-RUSH/demo-hatespeech-1$ cd NAVER-AI-RUSH/hate_speech/
ubuntu16@ubuntu16-VirtualBox:~/airushdemo/src/NAVER-AI-RUSH/demo-hatespeech-1/NAVER-AI-RUSH/hate_speech$ cat README.md 
# Hate speech classification
AI Rush 혐오댓글 분류를 위한 경로 입니다. 
Baseline model은 간단한 windowed RNN을 사용하였습니다.

## Repository format
`hate_speech/main.py` 학습 방법과 nsml.bind 함수에 대한 정의
`hate_speech/data.py` Data를 load하는 방법 정의
`hate_speech/model.py` Baseline model 정의
`hate_speech/field.json` Data의 vocab에 대한 정의 (only for torchtext)

## Run experiment

To run the baseline model training, stand in the `airush2020/spam` folder and run 
\```
nsml run -e main.py  -m "A good message" -d hatespeech-1
\```

## Metric
[F1 Score](https://en.wikipedia.org/wiki/F1_score) 를 사용 합니다.

## Data
개인정보 이슈로 tokenize 이후 numericalize 된 형태로 제공 됩니다.
- tokeninzer
   - 음절 기반 tokenizer
      - 고의적 오탈자와, 신조어가 많은 한국어 댓글 데이터에서는  
      형태소 기반 tokenizer, [BPE](https://en.wikipedia.org/wiki/Byte_pair_encoding), [wordpiece tokenizer](https://arxiv.org/pdf/1609.08144.pdf) 가 정상동작 하지 못합니다.
   - vocab 
      - vocab를 이용 역산하여 원문을 밝힐 수 있기에 공개하지 못하였습니다.
      - special tokens  
      UNK: 0, PAD:1, SPACE:2,  BEGIN:3, EOF: 4

- e.g.   {"syllable_contents": [3, 32, 218, 12, 25, 2, 205, 337, 16, 2, 113, 9, 2, 558, 195, 16, 2, 113, 17, 68, 2, 288, 51, 39, 12, 25, 4], "eval_reply": 0}   
- 가혹한 제약조건 속에서도 창의적인 도전을 기원합니다.

### Format
See AI Rush dataset documentation.

ubuntu16@ubuntu16-VirtualBox:~/airushdemo/src/NAVER-AI-RUSH/demo-hatespeech-1/NAVER-AI-RUSH/hate_speech$ nsml run -e main.py  -m "A good message" -d hatespeech-1
INFO[2020/07/15 18:03:59.459] .nsmlignore check - start                    
INFO[2020/07/15 18:03:59.459] .nsmlignore check - done                     
INFO[2020/07/15 18:03:59.492] file integrity check - start                 
INFO[2020/07/15 18:03:59.493] file integrity check - done                  
INFO[2020/07/15 18:03:59.493] .nsmlignore 20 B - start                     
INFO[2020/07/15 18:03:59.503] .nsmlignore 20 B - done (1/7 14.29%) (20 B/12 KiB 0.16%) 
INFO[2020/07/15 18:03:59.503] README.md 1.5 KiB - start                    
INFO[2020/07/15 18:03:59.503] README.md 1.5 KiB - done (2/7 28.57%) (1.6 KiB/12 KiB 12.56%) 
INFO[2020/07/15 18:03:59.503] data.py 2.6 KiB - start                      
INFO[2020/07/15 18:03:59.503] data.py 2.6 KiB - done (3/7 42.86%) (4.2 KiB/12 KiB 33.59%) 
INFO[2020/07/15 18:03:59.503] fields.json 526 B - start                    
INFO[2020/07/15 18:03:59.503] fields.json 526 B - done (4/7 57.14%) (4.7 KiB/12 KiB 37.74%) 
INFO[2020/07/15 18:03:59.503] main.py 5.7 KiB - start                      
INFO[2020/07/15 18:03:59.503] main.py 5.7 KiB - done (5/7 71.43%) (10 KiB/12 KiB 83.98%) 
INFO[2020/07/15 18:03:59.503] model.py 1.6 KiB - start                     
INFO[2020/07/15 18:03:59.503] model.py 1.6 KiB - done (6/7 85.71%) (12 KiB/12 KiB 96.75%) 
INFO[2020/07/15 18:03:59.503] setup.py 412 B - start                       
INFO[2020/07/15 18:03:59.503] setup.py 412 B - done (7/7 100.00%) (12 KiB/12 KiB 100.00%) 
......
Building docker image. It may take a while
.......
Session bluebrush/hatespeech-1/10 is started
ubuntu16@ubuntu16-VirtualBox:~/airushdemo/src/NAVER-AI-RUSH/demo-hatespeech-1/NAVER-AI-RUSH/hate_speech$ nsml ps
Name                       Created        Args    Status    Summary    Description     # of Models    Size       Type
-------------------------  -------------  ------  --------  ---------  --------------  -------------  ---------  ------
bluebrush/hatespeech-1/10  seconds ago            Running              A good message  0              0          normal
bluebrush/hatespeech-1/6   9 minutes ago          Running              A good message  15             201.82 MB  normal
ubuntu16@ubuntu16-VirtualBox:~/airushdemo/src/NAVER-AI-RUSH/demo-hatespeech-1/NAVER-AI-RUSH/hate_speech$ nsml model ls bluebrush/hatespeech-1/10
Checkpoint    Last Modified    Elapsed    Summary            Size
------------  ---------------  ---------  -----------------  --------
0             4 minutes ago    0.000      number_of_files=1  13.45 MB
1             4 minutes ago    38.104     number_of_files=1  13.45 MB
2             3 minutes ago    38.282     number_of_files=1  13.45 MB
3             3 minutes ago    38.408     number_of_files=1  13.45 MB
4             2 minutes ago    36.556     number_of_files=1  13.45 MB
5             2 minutes ago    36.732     number_of_files=1  13.45 MB
6             a minute ago     36.894     number_of_files=1  13.45 MB
7             seconds ago      36.654     number_of_files=1  13.45 MB
ubuntu16@ubuntu16-VirtualBox:~/airushdemo/src/NAVER-AI-RUSH/demo-hatespeech-1/NAVER-AI-RUSH/hate_speech$ nsml submit bluebrush/hatespeech-1/10 6
.......
Building docker image. It may take a while
...........load nsml model takes 2.0793983936309814 seconds
.Infer test set. The inference should be completed within 3600 seconds.
.Infer test set takes 11.140313148498535 seconds
...
Score: 0.9137414965986393
Done
naver-airush / NAVER-AI-RUSH

[공지] hatespeech-1 dataset, baseline 공개 #34