문제

텍스트가 입력으로 주어질 때, 단어의 개수를 세는 프로그램을 작성한다. "문자 세기"와 "단어 세기"는 프로그래밍 입문에 성공했는지를 가늠하는 문제라고 할 수 있습니다. 지금은 발가락으로도 만드는 분들이 많겠지만 처음에는 의외로 많이 어려워합니다. 추억을 살려 봅시다.

입력

아래 내용을 가진 텍스트파일을 미리 만들어 두고, 프로그램을 실행하면 파일 내용을 읽어들인다(출처: Wikipedia).

As the country became embroiled in a domestic crisis, the first government was dislodged and succeeded by several different administrations. Bolikango served as Deputy Prime Minister in one of the new governments before a partial state of stability was reestablished in 1961. He mediated between warring factions in the Congo and briefly served once again as Deputy Prime Minister in 1962 before returning to the parliamentary opposition. After Joseph-Desire Mobutu took power in 1965, Bolikango became a minister in his government. Mobutu soon dismissed him but appointed him to the political bureau of the Mouvement Populaire de la Revolution. Bolikango left the bureau in 1970. He left Parliament in 1975 and died seven years later. His grandson created the Jean Bolikango Foundation in his memory to promote social progress. The President of the Congo posthumously awarded Bolikango a medal in 2005 for his long career in public service.

출력

구분자(Separator)는 마침표 '.', 쉼표 ',', 공백 ' ' 이다. 가장 많이 나온 순서대로 단어 10개와 그 단어의 빈도를 출력한다. 빈도가 같은 단어들 사이의 순서는 무시한다.

in 12
the 10
Bolikango 5
a 4
of 4
and 3
to 3
his 3
became 2
government 2

출처: http://codingdojang.com/scode/634?orderby=&langby=python#answer-filter-area

작성코드

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize as wt
import collections

# 파일 읽기
with open ('text_input.txt') as data:
    file = data.read()
# file = 'As the country became embroiled in a domestic crisis, the first government was dislodged and succeeded by several different administrations. Bolikango served as Deputy Prime Minister in one of the new governments before a partial state of stability was reestablished in 1961. He mediated between warring factions in the Congo and briefly served once again as Deputy Prime Minister in 1962 before returning to the parliamentary opposition. After Joseph-Desire Mobutu took power in 1965, Bolikango became a minister in his government. Mobutu soon dismissed him but appointed him to the political bureau of the Mouvement Populaire de la Revolution. Bolikango left the bureau in 1970. He left Parliament in 1975 and died seven years later. His grandson created the Jean Bolikango Foundation in his memory to promote social progress. The President of the Congo posthumously awarded Bolikango a medal in 2005 for his long career in public service.'

# 전처리 , . 등 구분자 제거
file = file.replace(",","")
file = file.replace(".","")

# 단어 카운트
text = wt(file)
text_count = collections.Counter(text)
final = text_count.most_common(10) # most_common(10) : 빈도 높은 10가지 단어 출력

for i,j in final:
    print(i,j)

설명

nltk 자연어 처리를 위한 패키지,
from nltk.tokenize import word_tokenize as wt nltk.tokenize라는 모듈 안에 word_tokenize 함수가 존재하며, 단어 단위로 토큰화할 수 있는 함수.
- 토큰이란? : 긴 문자열을 분석을 위한 작은 단위로 나누는 과정, 토큰 함수를 이용하면 문자열을 입력받아 토큰 문자열의 리스트를 출력 가능함.
with expression as target : suite 파일 읽고 쓰기 위해 사용, 아래와 같이 읽어줘도 가능.
```
f=open('text_input.txt', 'r')
print(f.readline(), end="")
```
file.replace(",","") file에 있는 자료 내 , 구분자를 공백으로 대체하기
text = wt(file) 앞서 import 했던 word_tokenize 함수 사용하여 단어 단위로 잘라주기
text_count = collections.Counter(text) collections 모듈의 Counter 클래스를 사용하면 dictionary 구조 형태로 갯수를 세어줌
final = text_count.most_common(10) 빈도 높은 10개만 출력

다른 풀이(스터디같이하는 분들) 아래 블로그 보고 공부함.

참고링크1

참고링크1

최종 완성코드


### File Input
with open ('text_input.txt') as data:
file = data.read()

Split

file = file.replace(",","").replace(".","").split() # split() : 공백 기준으로 문자열을 split~! word_list_set = list(set(file)) # 중복되지 않은 단어 집합

Count

word_cnt = [] for i in word_list_set: word_cnt.append([file.count(i), i]) # file.count(i) :file내에서 i와 같은 문자열 카운트 해줌

Sort

sorted(word_cnt, reverse=True)

word_cnt.sort(key= lambda x:x[0], reverse=True)

Print

for i in range(10): print(word_cnt[i][0], word_cnt[i][1])

![image](https://user-images.githubusercontent.com/45617225/83722967-d6e77200-a678-11ea-8b6d-efaed2bde48c.png)

## 상세 내용- 단계별
### step1. replace, split
```python
with open ('text_input.txt') as data:
    file = data.read()

file = file.replace(",","").replace(".","").split() # split() : 공백 기준으로 split 
file

step2. set

file = file.replace(",","").replace(".","").split() # split() : 공백 기준으로 split 
word_list_set = set(file)
word_list_set

✔️ set 은 집합 형태로 나태내 준다. 즉 {} 에 원소 담아서 output 해준다. 또한 중복 제거한 원소를 반환해준다. 그래서 아래와 같이 중복된 값 제외해서 길이 짧아진 것 알 수 있다.

step3. list(set

file = file.replace(",","").replace(".","").split() # split() : 공백 기준으로 split 
word_list_set = list(set(file))
word_list_set

그래서 word_list_set 을 set 형식에서 list 속에 넣어준다!

step4. 검사파일.count(비교할문자)

file = file.replace(",","").replace(".","").split() # split() : 공백 기준으로 split 
word_list_set = list(set(file))

word_cnt = []
for i in word_list_set:
    word_cnt.append([file.count(i), i]) # file.count(i) :file내에서 i와 같은 문자열 카운트 해줌

word_cnt

step5 sort, sorted

sort하는 방법은 변수.sort(key=..., reverse = ...) 와 sorted(변수, reverse =True) 이렇게인데 sorted 쓰면 첫번째 칼럼(칼럼인덱스=0) 이 디폴트 기준으로 정렬된다.

step6. print

출력해서 보여주기
처음에는 아래처럼 했는데 list속에 들어있고 'in' 이렇게 문자열 형태로 추출됨. -> ❌
따라서 아래처럼 해줘야 함.(첫번째 원소, 두번째 원소 각각 출력)

jeeyeonLIM / coding_test

Level2. 단어 세기 Word Counting #5

문제

입력

출력