check test set distribution

rhycha commented 3 months ago

we forget to check the Eda of test set. see the datarange difference, and consider It when we make benchmarking

rhycha commented 3 months ago

일부로 분포 극한으로 다르게해놨음. trainset 에서 higher y 값을 갖는 분포를 확인해보자.

rhycha commented 3 months ago

before that, check this box plot comparison. left one is test dataset

rhycha commented 3 months ago

the right figure is the first quantile of train data, as you see, the distribution become monotonic

rhycha commented 3 months ago

85percent

rhycha commented 3 months ago

90 percent now I set left figure as original train set

kyungheee commented 3 months ago

@rhycha 무슨 말인진 모르겠지만 칭찬의 박수!

rhycha commented 3 months ago

95 percent

rhycha commented 3 months ago

98

rhycha commented 3 months ago

99

rhycha commented 3 months ago

995

rhycha commented 3 months ago

999

kyungheee commented 3 months ago

@rhycha 기대하고 있겠습니다! 😀

rhycha commented 3 months ago

이 추론 사용 시 유의점 모델 학습과 추론에서 평가 데이터셋 정보 활용(Data Leakage)시 수상 제외

결론 상위 y값들은 test에서 train분포를 뺀 것만큼 나와있음.

train과 test x축 도메인 범위는 거의 아예 같음.

시사점 - 전략

많은 곳이거나 y값이 낮으면 undersamling, 적은 곳이거나 y값이 높으면 oversampling하면 test data분포를 정확히 얻을 수 있음.

유의점 90%예측이어서(상위 1%예측이면 하위 y값 아예 트레이닝에서 무시해버릴 수 있으나) 너무 체리피킹하면 안됨.

상세설명 best y value는 분포의 반대편에 위치함.

도메인 범위 잘리는 것 90%시에 x3 0.80에서 x7 -0.20, -0.06에서 x8 왼쪽 샤프하고 exponentially decrease. x9 0.25 cutting

kyungheee / 2024-Samsung-AI-Challenge-Black-box-Optimization