Closed riow1983 closed 3 years ago
cleanded_label
を教師ラベル, pub_category
(cleaned_labelをカテゴリ分類したもの)をgroupにして分割.
def get_cv(dataset, num_splits=None, col_target=None, col_group=None):
"""
Args:
dataset: pd.DataFrame
num_splits: int
col_target: str
col_group: str
Returns:
folds: pd.DataFrame
"""
X = dataset.index.values
y = dataset[col_target].values
groups = dataset[col_group].values
group_kfold = GroupKFold(n_splits=num_splits)
group_kfold.get_n_splits(X, y, groups)
folds = pd.DataFrame()
for i, (_, test_index) in enumerate(group_kfold.split(X, y, groups)):
X_test = X[test_index]
X_test = dataset[dataset.index.isin(X_test)]
# Concat all and save at once
X_test["fold"] = i+1
folds = pd.concat([folds, X_test], ignore_index=True)
return folds
folds = get_cv(df, num_splits=5, col_target="cleaned_label", col_group="pub_category")
@qllolollp @Toru-Ito1 CVデータ作成一応完了です。 https://www.kaggle.com/riow1983/nb009-cv?select=folds_pubcat.pkl
作成ロジックは前回定例会で話した通り: 1) 目視確認よるカテゴライズ 2) 目視カテゴライズしたもの同志でコサイン類似度算出 3) コサイン類似度0.95以上(閾値)のものをedgeで結合しグラフ化 4) edgeで結ばれたものを一つのカテゴリーとして採用 (複数の目視カテゴリを"+"で結合する方針ですが閾値を0.95と高くしたため目視カテゴリからの差分はなさそうです) 5) カテゴリーをgroupとしてGroupKfold
作成ロジック詳細については下記ノートブックをご確認ください: https://github.com/riow1983/Kaggle-Coleridge-Initiative/blob/main/notebooks/nb009-cv.ipynb
問題がなさそうならcloseしますのでフィードバックお待ちしております。
結果一部抜粋:
CV: 1 -------------------------------------------------------------------
#### train pub_category:
['aging integrated database' 'agricultural resources management survey'
'arms farm financial and crop production practices'
'baccalaureate and beyond longitudinal study'
'baltimore longitudinal study of aging'
'beginning postsecondary students' 'breeding bird survey'
'cas covid 19 antiviral candidate compounds dataset'
'census of agriculture'
'characterizing health associated risks and your baseline disease in sars cov 2'
'coastal change analysis program' 'common core of data'
'complexity science hub covid 19 control strategies list'
'covid 19 death data' 'covid 19 genome sequences'
'covid 19 image data collection' 'covid 19 open research dataset'
'covid 19 our world in data' 'early childhood longitudinal study'
'education longitudinal study' 'ffrdc research and development survey'
'high school longitudinal study'
'higher education research and development survey'
'international best track archive for climate stewardship'
'jh crown registry' 'national assessment of education progress'
'optimum interpolation sea surface temperature'
'program for the international assessment of adult competencies'
'rsna international covid 19 open radiology database'
'rural urban continuum codes' 'school survey on crime and safety'
'seismic system comprehensive catalog' 'storm surge risk'
'survey of doctorate recipients' 'survey of earned doctorates'
'survey of graduate students and postdoctorates in science and engineering'
'survey of industrial research and development'
'survey of science and engineering research facilities'
'survey of state government research and development'
'teacher and principal survey'
'the national institute on aging genetics of alzheimer s disease data storage site'
'tide station' 'trends in international mathematics and science study'
'water level observation network' 'world ocean database']
#### dev pub_category:
['alzheimers disease neuroimaging initiative']
CV: 2 -------------------------------------------------------------------
#### train pub_category:
['agricultural resources management survey'
'alzheimers disease neuroimaging initiative'
'arms farm financial and crop production practices'
'baccalaureate and beyond longitudinal study'
'beginning postsecondary students'
'cas covid 19 antiviral candidate compounds dataset'
'census of agriculture'
'characterizing health associated risks and your baseline disease in sars cov 2'
'coastal change analysis program' 'common core of data'
'complexity science hub covid 19 control strategies list'
'covid 19 death data' 'covid 19 genome sequences'
'covid 19 image data collection' 'covid 19 open research dataset'
'covid 19 our world in data' 'early childhood longitudinal study'
'education longitudinal study' 'ffrdc research and development survey'
'higher education research and development survey' 'jh crown registry'
'national assessment of education progress' 'rural urban continuum codes'
'school survey on crime and safety' 'storm surge risk'
'survey of doctorate recipients' 'survey of earned doctorates'
'survey of graduate students and postdoctorates in science and engineering'
'survey of industrial research and development'
'survey of state government research and development'
'teacher and principal survey'
'the national institute on aging genetics of alzheimer s disease data storage site'
'tide station' 'trends in international mathematics and science study'
'water level observation network']
#### dev pub_category:
['aging integrated database' 'baltimore longitudinal study of aging'
'breeding bird survey' 'high school longitudinal study'
'international best track archive for climate stewardship'
'optimum interpolation sea surface temperature'
'program for the international assessment of adult competencies'
'rsna international covid 19 open radiology database'
'seismic system comprehensive catalog'
'survey of science and engineering research facilities'
'world ocean database']
CV: 3 -------------------------------------------------------------------
#### train pub_category:
['aging integrated database' 'alzheimers disease neuroimaging initiative'
'arms farm financial and crop production practices'
'baltimore longitudinal study of aging'
'beginning postsecondary students' 'breeding bird survey'
'census of agriculture'
'characterizing health associated risks and your baseline disease in sars cov 2'
'coastal change analysis program'
'complexity science hub covid 19 control strategies list'
'covid 19 death data' 'covid 19 genome sequences'
'covid 19 image data collection' 'covid 19 open research dataset'
'covid 19 our world in data' 'early childhood longitudinal study'
'ffrdc research and development survey' 'high school longitudinal study'
'international best track archive for climate stewardship'
'jh crown registry' 'optimum interpolation sea surface temperature'
'program for the international assessment of adult competencies'
'rsna international covid 19 open radiology database'
'seismic system comprehensive catalog' 'storm surge risk'
'survey of doctorate recipients' 'survey of earned doctorates'
'survey of graduate students and postdoctorates in science and engineering'
'survey of science and engineering research facilities'
'teacher and principal survey' 'tide station'
'trends in international mathematics and science study'
'water level observation network' 'world ocean database']
#### dev pub_category:
['agricultural resources management survey'
'baccalaureate and beyond longitudinal study'
'cas covid 19 antiviral candidate compounds dataset'
'common core of data' 'education longitudinal study'
'higher education research and development survey'
'national assessment of education progress' 'rural urban continuum codes'
'school survey on crime and safety'
'survey of industrial research and development'
'survey of state government research and development'
'the national institute on aging genetics of alzheimer s disease data storage site']
CV: 4 -------------------------------------------------------------------
#### train pub_category:
['aging integrated database' 'agricultural resources management survey'
'alzheimers disease neuroimaging initiative'
'arms farm financial and crop production practices'
'baccalaureate and beyond longitudinal study'
'baltimore longitudinal study of aging' 'breeding bird survey'
'cas covid 19 antiviral candidate compounds dataset'
'characterizing health associated risks and your baseline disease in sars cov 2'
'coastal change analysis program' 'common core of data'
'covid 19 genome sequences' 'covid 19 our world in data'
'early childhood longitudinal study' 'education longitudinal study'
'ffrdc research and development survey' 'high school longitudinal study'
'higher education research and development survey'
'international best track archive for climate stewardship'
'national assessment of education progress'
'optimum interpolation sea surface temperature'
'program for the international assessment of adult competencies'
'rsna international covid 19 open radiology database'
'rural urban continuum codes' 'school survey on crime and safety'
'seismic system comprehensive catalog' 'storm surge risk'
'survey of earned doctorates'
'survey of industrial research and development'
'survey of science and engineering research facilities'
'survey of state government research and development'
'teacher and principal survey'
'the national institute on aging genetics of alzheimer s disease data storage site'
'water level observation network' 'world ocean database']
#### dev pub_category:
['beginning postsecondary students' 'census of agriculture'
'complexity science hub covid 19 control strategies list'
'covid 19 death data' 'covid 19 image data collection'
'covid 19 open research dataset' 'jh crown registry'
'survey of doctorate recipients'
'survey of graduate students and postdoctorates in science and engineering'
'tide station' 'trends in international mathematics and science study']
CV: 5 -------------------------------------------------------------------
#### train pub_category:
['aging integrated database' 'agricultural resources management survey'
'alzheimers disease neuroimaging initiative'
'baccalaureate and beyond longitudinal study'
'baltimore longitudinal study of aging'
'beginning postsecondary students' 'breeding bird survey'
'cas covid 19 antiviral candidate compounds dataset'
'census of agriculture' 'common core of data'
'complexity science hub covid 19 control strategies list'
'covid 19 death data' 'covid 19 image data collection'
'covid 19 open research dataset' 'education longitudinal study'
'high school longitudinal study'
'higher education research and development survey'
'international best track archive for climate stewardship'
'jh crown registry' 'national assessment of education progress'
'optimum interpolation sea surface temperature'
'program for the international assessment of adult competencies'
'rsna international covid 19 open radiology database'
'rural urban continuum codes' 'school survey on crime and safety'
'seismic system comprehensive catalog' 'survey of doctorate recipients'
'survey of graduate students and postdoctorates in science and engineering'
'survey of industrial research and development'
'survey of science and engineering research facilities'
'survey of state government research and development'
'the national institute on aging genetics of alzheimer s disease data storage site'
'tide station' 'trends in international mathematics and science study'
'world ocean database']
#### dev pub_category:
['arms farm financial and crop production practices'
'characterizing health associated risks and your baseline disease in sars cov 2'
'coastal change analysis program' 'covid 19 genome sequences'
'covid 19 our world in data' 'early childhood longitudinal study'
'ffrdc research and development survey' 'storm surge risk'
'survey of earned doctorates' 'teacher and principal survey'
'water level observation network']
@qllolollp pub_categoryの中身がおかしいかもしれない件について、以下notebookで確認しましたが、おかしなところは無いように見えますが、いかがでしょう? https://www.kaggle.com/riow1983/kagglenb011-check-cv-data
同じIdの論文が複数ラベルを持っている場合、複数のfoldに含まれることがあるようです。
cv_folds[cv_folds.duplicated(subset="Id") & ~(cv_folds.duplicated(subset=["Id", "fold"]))]
で確認できます。
(例えば"7dd31a80-2389-4c66-a041-29367e109f87"など)
CV分割をする前に, train.csvで同一Idで複数行ある(=複数のcleaned_labelがある)ものについては, 1行に集約するか, 中間カテゴリに落ちるなどの処理を加えてみます.
@qllolollp 上で予告していた処理完了しました. 作成ノートブック: https://github.com/riow1983/Kaggle-Coleridge-Initiative/blob/main/notebooks/nb009-cv.ipynb データセット version 3: https://www.kaggle.com/riow1983/nb009-cv
df["pub_title"].nunique() < df["Id"].nunique()
であることからIdでユニークにするよりpub_titleでユニークにした方が良さそうなので修正します.
上記修正完了しました. 作成ノートブック: https://github.com/riow1983/Kaggle-Coleridge-Initiative/blob/main/notebooks/nb009-cv.ipynb データセット version 4: https://www.kaggle.com/riow1983/nb009-cv
@qllolollp
同一pub_titleで複数のpub_categoryを持つものは, 例えば
"adni + noaa + slosh"
などとなってますが, これら全ての組み合わせ
"adni",
"noaa",
"slosh",
"adni + noaa",
"adni + slosh",
"noaa + slosh",
"adni + noaa + slosh"
について、一つのカテゴリ(例えば"adni + noaa + slosh"
)に落ちるように変換してからCVを切る、ならひとまず目的は達成されると思いましたがいかがでしょう?
"adni + hoge"
のpub_titleがあったら、カテゴリは"adni + noaa + slosh + hoge"
になるんですよね?
全部の複数label持ちをケアしようとすると結局一つのカテゴリが巨大になりすぎてしまうのではないかと懸念してましたが、領域が違うとlabelも異なるはずなので、案外きれいに分かれるのかもしれないという気もしてきました。
あと自分が使う上では、現状のtestセットから複数label持ちを抜くだけの応急処置でもそんなに困ってないです。
@qllolollp
"adni + hoge"のpub_titleがあったら、カテゴリは"adni + noaa + slosh + hoge"になるんですよね?
はい、そうなります。 確かに1つのカテゴリが幅広になって一見して何の話題なのか分かりにくくなるという副作用はあると思います。 ただCVに関する目的(同じlabelを持つ論文がtrainとvalid両方に現れることを防ぐ)は達成できるはずなので、やってみたいと思います。
@qllolollp 上記修正完了しました. pub_category(=group)の数は一挙に12にまで削減されました. 作成ノートブック: https://github.com/riow1983/Kaggle-Coleridge-Initiative/blob/main/notebooks/nb009-cv.ipynb データセット version 5: https://www.kaggle.com/riow1983/nb009-cv
@qllolollp 上記修正完了しました. pub_category(=group)の数は一挙に12にまで削減されました. 作成ノートブック: https://github.com/riow1983/Kaggle-Coleridge-Initiative/blob/main/notebooks/nb009-cv.ipynb データセット version 5: https://www.kaggle.com/riow1983/nb009-cv
foldによってはvalidのobs数が50程度にまで落ち込んでいたためversion 5では完了できない. version 4であれば, post-processingでtrainに現れたcleaned_labelを持つインスタンスがvalid側にも現れた時にそれをdropする処理をした上でかつ満足なobs数を確保できるため, 一旦はそれに戻す.
@qllolollp すみません、version4に戻した上でpost processing (trainに現れたcleaned_labelを持つインスタンスがvalidに現れた場合はdropする処理)も加えたものをdatasetとしてシェアしたいと思っているのですが、自分でこの処理をやってみるとtrian/validのobs数がfoldによってはvalidがかなり削られてしまいました。
fold 1
len(train): 14054
len(valid): 217
fold 2
len(train): 12998
len(valid): 1273
fold 3
len(train): 14252
len(valid): 19
fold 4
len(train): 13625
len(valid): 646
fold 5
len(train): 11653
len(valid): 2618
加えた処理は以下の関数になります:
def post_process(df, drop=False):
for i in range(5):
train = df[df["fold"] != i+1]
dev = df[df["fold"] == i+1]
obs_labels_train = set()
for _,row in tqdm(train.iterrows(), desc=f"processing for fold {i+1}..."):
for cl in row["cleaned_label"].split("|"):
obs_labels_train.add(cl)
dev["cleaned_label"] = dev["cleaned_label"].apply(lambda x: "to_train" if len(set(x.split("|")).intersection(obs_labels_train))>0 else x)
real_dev = dev[dev["cleaned_label"]!="to_train"].reset_index(drop=True)
dev2train = dev[dev["cleaned_label"]=="to_train"].reset_index(drop=True)
dev2train["fold"] = dev2train["fold"]+100 # update fold number
print(f"#### fold {i+1} ... {len(dev2train)/len(df)} % of observations are rejected as dev.")
if drop:
print(f"len(train): {len(train)}, len(dev): {len(real_dev)}")
df = pd.concat([train, real_dev], axis=0, ignore_index=True)
else:
print(f"len(train): {len(train)+len(dev2train)}, len(dev): {len(real_dev)}")
df = pd.concat([train, dev2train, real_dev], axis=0, ignore_index=True)
return df
folds = post_process(folds, drop=False)
drop=Falseとしてdev側として採用を見送ったインスタンスはtrian側に戻すという方法を取っています。 ちなみにtrain側に戻さず捨てる場合はdrop=Trueとして実行できます。結果は:
fold 1
len(train): 4556
len(valid): 217
fold 2
len(train): 3500
len(valid): 1273
fold 3
len(train): 4754
len(valid): 19
fold 4
len(train): 4127
len(valid): 646
fold 5
len(train): 2155
len(valid): 2618
いずれにしてもvalidがこんなに削られるのはおかしいと感じてますが、アドバイスいただけると幸いです。
@qllolollp @Toru-Ito1 post processingについて上記の問題がありましたので一旦スキップし, Kaggle Dataset "nb009"はversion4と同等のものに戻しました. version6がそれです.
作成ノートブック: https://github.com/riow1983/Kaggle-Coleridge-Initiative/blob/main/notebooks/nb009-cv.ipynb データセット version 6: https://www.kaggle.com/riow1983/nb009-cv
すみません、自分が応急処置的に抜く処理をやっていると言っていたのは以下の問題と混同していて(この問題自体は結構前に解決されてますが)、同じlabelがtrainとvalidに含まれる問題に対しては自分も対処出来てなかったです。
同じIdの論文が複数ラベルを持っている場合、複数のfoldに含まれることがあるようです。
cv_folds[cv_folds.duplicated(subset="Id") & ~(cv_folds.duplicated(subset=["Id", "fold"]))]
で確認できます。 (例えば"7dd31a80-2389-4c66-a041-29367e109f87"など)
@qllolollp なるほど了解しました。 何か思いつくまでは一旦version6 (=version4)で据置ますが、さてどうするか。
@qllolollp たびたびすみません、何かアイディアが浮かんだら教えてください。
自分でもちょこちょこ試してみて、件数が上で堀内さんが挙げているものとは一致してないのですが、以下のような発見がありました。
fold=1にcleaned_labelが"adni"のものがあって(fold=1のほとんど)、fold=3に"baltimore longitudinal study of aging blsa |adni"があるので、fold=1のvalidがスカスカになってしまうのは仕方ないかと。
そういう意味ではVersion5は完全に分離はできているので、件数のバランスは悪いですが(8,132/4,968/1,065/53/53)、どうせ5fold分回す暇もないですし、例えばfold=3の1,065件をvalidにしたパターン一発で未知データに対する精度を見るのはどうでしょうか?
それが現実的かもですね。 @Toru-Ito1 異論なければversion5に戻します。
はい、異論ないです。よろしくお願いいたします。
@qllolollp @Toru-Ito1 了解しました。version5に戻します。また追って展開します。
@qllolollp @Toru-Ito1 Kaggle Dataset "nb009"はversion5と同等のものに戻しました. version7がそれです.
作成ノートブック: https://github.com/riow1983/Kaggle-Coleridge-Initiative/blob/main/notebooks/nb009-cv.ipynb
データセット version 7: https://www.kaggle.com/riow1983/nb009-cv
version 7の各foldごとのtrain/valid内訳は以下の通りです:
fold 1
len(train): 6139
len(valid): 8132
fold 2
len(train): 9303
len(valid): 4968
fold 3
len(train): 13206
len(valid): 1065
fold 4
len(train): 14218
len(valid): 53
fold 5
len(train): 14218
len(valid): 53
この内参考までにfold 2, 3のカテゴリ内訳を載せておきます. fold 3のdev側はCOVID19関連のものだけになってます.
CV: 2 -------------------------------------------------------------------
#### train pub_category:
['aging integrated database'
'agricultural resources management survey + arms farm financial and crop production practices + baccalaureate and beyond longitudinal study + beginning postsecondary students + breeding bird survey + census of agriculture + coastal change analysis program + common core of data + early childhood longitudinal study + education longitudinal study + ffrdc research and development survey + high school longitudinal study + higher education research and development survey + international best track archive for climate stewardship + national assessment of education progress + optimum interpolation sea surface temperature + program for the international assessment of adult competencies + rural urban continuum codes + school survey on crime and safety + storm surge risk + survey of doctorate recipients + survey of earned doctorates + survey of graduate students and postdoctorates in science and engineering + survey of industrial research and development + survey of science and engineering research facilities + survey of state government research and development + teacher and principal survey + tide station + trends in international mathematics and science study + water level observation network + world ocean database'
'cas covid 19 antiviral candidate compounds dataset'
'characterizing health associated risks and your baseline disease in sars cov 2'
'complexity science hub covid 19 control strategies list'
'covid 19 death data'
'covid 19 genome sequences + covid 19 open research dataset + covid 19 our world in data'
'covid 19 image data collection' 'jh crown registry'
'rsna international covid 19 open radiology database'
'seismic system comprehensive catalog']
#### dev pub_category:
['alzheimers disease neuroimaging initiative + baltimore longitudinal study of aging + the national institute on aging genetics of alzheimer s disease data storage site']
CV: 3 -------------------------------------------------------------------
#### train pub_category:
['aging integrated database'
'agricultural resources management survey + arms farm financial and crop production practices + baccalaureate and beyond longitudinal study + beginning postsecondary students + breeding bird survey + census of agriculture + coastal change analysis program + common core of data + early childhood longitudinal study + education longitudinal study + ffrdc research and development survey + high school longitudinal study + higher education research and development survey + international best track archive for climate stewardship + national assessment of education progress + optimum interpolation sea surface temperature + program for the international assessment of adult competencies + rural urban continuum codes + school survey on crime and safety + storm surge risk + survey of doctorate recipients + survey of earned doctorates + survey of graduate students and postdoctorates in science and engineering + survey of industrial research and development + survey of science and engineering research facilities + survey of state government research and development + teacher and principal survey + tide station + trends in international mathematics and science study + water level observation network + world ocean database'
'alzheimers disease neuroimaging initiative + baltimore longitudinal study of aging + the national institute on aging genetics of alzheimer s disease data storage site'
'cas covid 19 antiviral candidate compounds dataset'
'characterizing health associated risks and your baseline disease in sars cov 2'
'complexity science hub covid 19 control strategies list'
'covid 19 death data' 'covid 19 image data collection'
'jh crown registry' 'rsna international covid 19 open radiology database'
'seismic system comprehensive catalog']
#### dev pub_category:
['covid 19 genome sequences + covid 19 open research dataset + covid 19 our world in data']
[参考: cleaned_label 全130種 from train.csv]
reference