motefly / DeepGBM

SIGKDD'2019: DeepGBM: A Deep Learning Framework Distilled by GBDT for Online Prediction Tasks
647 stars 135 forks source link

请问zillow的测试集是哪个呢? #25

Closed hanfu closed 4 years ago

hanfu commented 4 years ago

请问zillow的测试集是哪个呢? y_test是哪个?

motefly commented 4 years ago

请参阅 https://www.kaggle.com/c/zillow-prize-1/datahttps://github.com/motefly/DeepGBM/blob/master/preprocess/encoding_cate.py#L135 找到一份之前处理zillow数据的代码,希望有用。测试集应该是自行在此基础上按比例划分的。

import pandas as pd
train = pd.read_csv('train_2016_v2.csv')
properties = pd.read_csv('properties_2016.csv')
print ("Shape Of Train: ",train.shape)
print ("Shape Of Properties: ",properties.shape)
merged = pd.merge(train,properties,on="parcelid",how="left")

drop_cols = ["parcelid", "transactiondate", "assessmentyear"]
merged=merged.drop(drop_cols, axis=1)
num_cols = ['bathroomcnt','bedroomcnt','calculatedbathnbr','threequarterbathnbr','finishedfloor1squarefeet','calculatedfinishedsquarefeet','finishedsquarefeet6','finishedsquarefeet12','finishedsquarefeet13','finishedsquarefeet15','finishedsquarefeet50','fireplacecnt','fullbathcnt','garagecarcnt','garagetotalsqft','latitude','longitude','lotsizesquarefeet','numberofstories','poolcnt','poolsizesum','roomcnt','unitcnt','yardbuildingsqft17','yardbuildingsqft17','taxvaluedollarcnt','structuretaxvaluedollarcnt','landtaxvaluedollarcnt','taxamount','taxdelinquencyyear','yearbuilt']
new_cate_cols = ['yearbuilt']
for col in new_cate_cols:
    merged[col+'_cate'] = merged[col]
merged.to_csv('zillow_all.csv',index=False)
hanfu commented 4 years ago

感谢回复. kaggle上只有训练集. encoding的代码是指定标签的代码. 你给的snippet是合并属性表和数据表. 所以可以理解为zillow只有训练集有样本和标签, 测试集需要从训练集里自己划分出来. 是这么操作么? 再次感谢!

motefly commented 4 years ago

嗯,应该是的。

hanfu commented 4 years ago

谢谢! 请问数据集里有哪些是提供了完整的训练和测试的标签的呢?

motefly commented 4 years ago

这个不重要,测试集都可以从公布了完整数据集(很多比赛可能大多都是只公开了完整的训练集)的基础上切割。