jeeyeonLIM / coding_test

Let's practice the coding test!
1 stars 0 forks source link

Level2. 단어 세기 Word Counting #5

Open jeeyeonLIM opened 4 years ago

jeeyeonLIM commented 4 years ago

문제

텍스트가 입력으로 주어질 때, 단어의 개수를 세는 프로그램을 작성한다. "문자 세기"와 "단어 세기"는 프로그래밍 입문에 성공했는지를 가늠하는 문제라고 할 수 있습니다. 지금은 발가락으로도 만드는 분들이 많겠지만 처음에는 의외로 많이 어려워합니다. 추억을 살려 봅시다.

입력

아래 내용을 가진 텍스트파일을 미리 만들어 두고, 프로그램을 실행하면 파일 내용을 읽어들인다(출처: Wikipedia).

As the country became embroiled in a domestic crisis, the first government was dislodged and succeeded by several different administrations. Bolikango served as Deputy Prime Minister in one of the new governments before a partial state of stability was reestablished in 1961. He mediated between warring factions in the Congo and briefly served once again as Deputy Prime Minister in 1962 before returning to the parliamentary opposition. After Joseph-Desire Mobutu took power in 1965, Bolikango became a minister in his government. Mobutu soon dismissed him but appointed him to the political bureau of the Mouvement Populaire de la Revolution. Bolikango left the bureau in 1970. He left Parliament in 1975 and died seven years later. His grandson created the Jean Bolikango Foundation in his memory to promote social progress. The President of the Congo posthumously awarded Bolikango a medal in 2005 for his long career in public service.

출력

구분자(Separator)는 마침표 '.', 쉼표 ',', 공백 ' ' 이다. 가장 많이 나온 순서대로 단어 10개와 그 단어의 빈도를 출력한다. 빈도가 같은 단어들 사이의 순서는 무시한다.

in 12
the 10
Bolikango 5
a 4
of 4
and 3
to 3
his 3
became 2
government 2
jeeyeonLIM commented 4 years ago

작성코드

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize as wt
import collections

# 파일 읽기
with open ('text_input.txt') as data:
    file = data.read()
# file = 'As the country became embroiled in a domestic crisis, the first government was dislodged and succeeded by several different administrations. Bolikango served as Deputy Prime Minister in one of the new governments before a partial state of stability was reestablished in 1961. He mediated between warring factions in the Congo and briefly served once again as Deputy Prime Minister in 1962 before returning to the parliamentary opposition. After Joseph-Desire Mobutu took power in 1965, Bolikango became a minister in his government. Mobutu soon dismissed him but appointed him to the political bureau of the Mouvement Populaire de la Revolution. Bolikango left the bureau in 1970. He left Parliament in 1975 and died seven years later. His grandson created the Jean Bolikango Foundation in his memory to promote social progress. The President of the Congo posthumously awarded Bolikango a medal in 2005 for his long career in public service.'

# 전처리 , . 등 구분자 제거
file = file.replace(",","")
file = file.replace(".","")

# 단어 카운트
text = wt(file)
text_count = collections.Counter(text)
final = text_count.most_common(10) # most_common(10) : 빈도 높은 10가지 단어 출력

for i,j in final:
    print(i,j)

image

설명

jeeyeonLIM commented 4 years ago

다른 풀이(스터디같이하는 분들) 아래 블로그 보고 공부함.

Split

file = file.replace(",","").replace(".","").split() # split() : 공백 기준으로 문자열을 split~! word_list_set = list(set(file)) # 중복되지 않은 단어 집합

Count

word_cnt = [] for i in word_list_set: word_cnt.append([file.count(i), i]) # file.count(i) :file내에서 i와 같은 문자열 카운트 해줌

Sort

sorted(word_cnt, reverse=True)

word_cnt.sort(key= lambda x:x[0], reverse=True)

Print

for i in range(10): print(word_cnt[i][0], word_cnt[i][1])

![image](https://user-images.githubusercontent.com/45617225/83722967-d6e77200-a678-11ea-8b6d-efaed2bde48c.png)

## 상세 내용- 단계별
### step1. replace, split
```python
with open ('text_input.txt') as data:
    file = data.read()

file = file.replace(",","").replace(".","").split() # split() : 공백 기준으로 split 
file

image

step2. set

file = file.replace(",","").replace(".","").split() # split() : 공백 기준으로 split 
word_list_set = set(file)
word_list_set

image ✔️ set 은 집합 형태로 나태내 준다. 즉 {} 에 원소 담아서 output 해준다. 또한 중복 제거한 원소를 반환해준다. 그래서 아래와 같이 중복된 값 제외해서 길이 짧아진 것 알 수 있다. image

step3. list(set

file = file.replace(",","").replace(".","").split() # split() : 공백 기준으로 split 
word_list_set = list(set(file))
word_list_set

image

step4. 검사파일.count(비교할문자)

file = file.replace(",","").replace(".","").split() # split() : 공백 기준으로 split 
word_list_set = list(set(file))

word_cnt = []
for i in word_list_set:
    word_cnt.append([file.count(i), i]) # file.count(i) :file내에서 i와 같은 문자열 카운트 해줌

word_cnt

image

step5 sort, sorted

image

step6. print