머딥 - 5d (train/test, 해시값, seed, 계층적 샘플링, hist)

vmtmxmf5 commented 3 years ago

풀어서 설명하기

from sklearn.model_selection import train_test_split

# random state = seed
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

train_test_split은 기본적으로 shuffle 지원

중요

# shuffle로 순차 데이터를 섞어준다
import numpy as np

np.random.permutation(100) # 0~99 100개 추출

def split_train_test(data, test_ratio=0.2):
    # 데이터 개수 만큼 무작위 인덱스를 가진 리스트 생성
    shuffled_indices = np.random.permutation(len(data))

    # 테스트 세트의 크기
    test_set_size = int(len(data) * test_ratio)

    # 테스트 세트의 인덱스
    test_indices = shuffled_indices[:test_set_size]

    # 훈련 세트의 인덱스
    train_indices = shuffled_indices[test_set_size:]

    # 인덱스의 '순번' 뽑기 위해 iloc
    return data.iloc[train_indices], data.iloc[test_indices]

df_train, df_test = split_train_test(housing)

중요

이 알고리즘의 문제점은?

무한하게 모델 테스트를 반복하면, train set과 test set이 계속해서 뒤섞인 데이터를 모델이 학습하게 된다. 따라서 언젠가는 머신러닝 알고리즘이 모든 데이터를 사용해서 오버피팅이 된다.

해결책

random seed를 고정한다 - np.random.seed(42)

-> 데이터 개수가 바뀌면 싹 다 바뀌어서 문제 발생

np.random.seed(42)

각 샘플(행) 마다의 해시값을 구하여 20%보다 작거나 같은 테스트 세트로 보낸다.

해시값 = 어떠한 데이터의 고유값 -> (1)기존의 인덱스 (2)해시함수 (3)해시값(=기존인덱스)

from zlib import crc32

def test_set_check(identifier, test_ratio=0.2):
    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * (2**32)
# 데이터의 개수가 2**32에 가깝지 않으면, test셋과 train셋의 비율이 2:8이 안 지켜줄 수도 있다. 운이 나빠서 train이나 test 한쪽에 몰릴 수 있기 때문이다.

test_set_check(2)

True 면 Test set 으로 들어가야 한다. 해시값이 20%보다 작으니까

False 면 train set 으로 들어가야 한다.

identifier = data의 인덱스 하나하나

def split_train_test_by_id(data, test_ratio, id_column):
    ids = data[id_column]

    # 테스트 세트로 포함될 위치 구하기
    in_test_set = ids.apply(lambda _id : test_set_check(_id, test_ratio))

    return data.loc[~in_test_set], data.loc[in_test_set]

housing 데이터에서는 식별자(id)로 사용할 수 있는게 행의 인덱스밖에 없다.

그런데 제약 사항이 있다

새로운 데이터의 추가는 반드시 행의 끝에서만 일어날 것. 어떠한 행도 삭제되면 안 된다.

==> 고유 식별자로 행의 인덱스 대신 위도 경도를 쓰는게 낫다. 이건 바뀌지 않으니까.

==> Nan값이 있다면 미리 쳐내고 train, test 분리해야 한다.

고유식별자가 행의 인덱스

housing_with_id = housing.reset_index()
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, 'index')

고유식별자가 위도 경도

housing_with_id['id'] = housing['longitude'] + housing['latitude']

train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, 'id')

vmtmxmf5 commented 3 years ago

중요

# 히스토그램으로 데이터 각을 본다 - scaling 각
import matplotlib.pyplot as plt

# bins 30, 40, 50 보통 많이 씀
housing.hist(bins=50, figsize=(20, 15))

vmtmxmf5 commented 3 years ago

## 파일 다운받기
import os
import tarfile
import urllib.request
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"

# 디렉토리 설정하기 - 기본 경로/datasets/housing
HOUSING_PATH = os.path.join("datasets", "housing")

# 다운로드 할 파일의 URL
HOUSING_URL  = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"
def fetch_housing_data(housing_url= HOUSING_URL, housing_path= HOUSING_PATH):
  # 디렉토리 만들기
  os.makedirs(housing_path, exist_ok= True)

  # 파일의 경로 - operating system
  tgz_path = os.path.join(housing_path, "housing.tgz")

  # url로 지정한 파일을 다운로드
  urllib.request.urlretrieve(housing_url, tgz_path)

  # 다운 받은 파일 열기
  housing_tgz = tarfile.open(tgz_path)

  # 압축파일 풀기
  housing_tgz.extractall(path= housing_path)
  housing_tgz.close()
fetch_housing_data()

import pandas as pd

def load_housing_data(housing=HOUSING_PATH, filename='housing.csv'):
    csv_path = os.path.join(housing, filename)
    return pd.read_csv(csv_path)

housing = load_housing_data()

vmtmxmf5 commented 3 years ago

중요

계층적 샘플링

housing['income_cat'] = pd.cut(
    housing['median_income'], #계층을 구할 데이터
    bins=[0.0, 1.5, 3.0, 4.5, 6.0, np.inf], #구간 계층
    labels=[1, 2, 3, 4, 5]
    )

housing['income_cat'].value_counts()
housing['income_cat'].hist()

from sklearn.model_selection import StratifiedShuffleSplit

strat_split = StratifiedShuffleSplit(
    n_splits=1,
    test_size=0.2,
    random_state=42
    )

strat_split에서 쪼개게 되면 train/test 데이터의 인덱스 등장

strat_split 안에는 [[obj1, obj2, obj3]] 형태로 들어가 있다. obj1는 col1이다.

아래 split은 파이썬 기본 split과 다르다

for train_idx, test_idx in strat_split.split(housing, housing['income_cat']):
    strat_train_set = housing.loc[train_idx]
    strat_test_set = housing.loc[test_idx]

# 테스트 세트에서 소득 카테고리 비율 확인하기    
strat_test_set['income_cat'].value_counts() / len(strat_test_set)

다 쓴 뒤에 삭제해도 되고 안 삭제해도 된다 (다중공선성 우려x)

housing['income_cat'] = pd.cut(
    housing['median_income'], #계층을 구할 데이터
    bins=[0.0, 1.5, 3.0, 4.5, 6.0, np.inf], #구간 계층
    labels=[1, 2, 3, 4, 5]
    )

for set_ in (strat_train_set, strat_test_set):
    set_.drop('income_cat', axis=1, inplace=True)

vmtmxmf5 / Python-ML-DNN

머딥 - 5d (train/test, 해시값, seed, 계층적 샘플링, hist) #32

중요

중요

이 알고리즘의 문제점은?

무한하게 모델 테스트를 반복하면, train set과 test set이 계속해서 뒤섞인 데이터를 모델이 학습하게 된다. 따라서 언젠가는 머신러닝 알고리즘이 모든 데이터를 사용해서 오버피팅이 된다.

해결책

random seed를 고정한다 - np.random.seed(42)

각 샘플(행) 마다의 해시값을 구하여 20%보다 작거나 같은 테스트 세트로 보낸다.

True 면 Test set 으로 들어가야 한다. 해시값이 20%보다 작으니까

False 면 train set 으로 들어가야 한다.

그런데 제약 사항이 있다

고유식별자가 행의 인덱스

고유식별자가 위도 경도

중요

중요

계층적 샘플링

strat_split에서 쪼개게 되면 train/test 데이터의 인덱스 등장

strat_split 안에는 [[obj1, obj2, obj3]] 형태로 들어가 있다. obj1는 col1이다.

아래 split은 파이썬 기본 split과 다르다

다 쓴 뒤에 삭제해도 되고 안 삭제해도 된다 (다중공선성 우려x)