riow1983 / Kaggle-Coleridge-Initiative

0 stars 0 forks source link

cleaned_labelをカテゴライズしたものでGroup KfoldしたCVを作成する (train.csvに対して) #9

Closed riow1983 closed 3 years ago

riow1983 commented 3 years ago

[参考: cleaned_label 全130種 from train.csv]

0 national education longitudinal study
1 noaa tidal station
2 slosh model
3 noaa c cap
4 aging integrated database agid 
5 alzheimers disease neuroimaging initiative
6 aging integrated database
7 noaa national water level observation network
8 noaa water level station
9 baltimore longitudinal study of aging blsa 
10 national water level observation network
11 arms farm financial and crop production practices
12 beginning postsecondary student
13 noaa sea lake and overland surges from hurricanes
14 noaa tide gauge
15 the national institute on aging genetics of alzheimer s disease data storage site
16 national center for education statistics common core of data
17 national science foundation survey of industrial research and development
18 baccalaureate and beyond
19 noaa international best track archive for climate stewardship
20 agricultural resource management survey
21 national teacher and principal survey
22 international best track archive for climate stewardship
23 nsf higher education research and development survey
24 national science foundation survey of earned doctorates
25 school survey on crime and safety
26 the national institute on aging genetics of alzheimer s disease data storage site niagads 
27 national oceanic and atmospheric administration world ocean database
28 beginning postsecondary students longitudinal study
29 nces common core of data
30 program for the international assessment of adult competencies
31 survey of earned doctorates
32 baltimore longitudinal study of aging
33 early childhood longitudinal study
34 adni
35 national science foundation survey of graduate students and postdoctorates in science and engineering
36 trends in international mathematics and science study
37 national oceanic and atmospheric administration c cap
38 nsf survey of earned doctorates
39 noaa tide station
40 education longitudinal study
41 optimum interpolation sea surface temperature
42 national oceanic and atmospheric administration optimum interpolation sea surface temperature
43 alzheimer s disease neuroimaging initiative adni 
44 baccalaureate and beyond longitudinal study
45 agricultural resources management survey
46 beginning postsecondary students
47 ibtracs
48 coastal change analysis program
49 survey of graduate students and postdoctorates in science and engineering
50 national assessment of education progress
51 sea surface temperature optimum interpolation
52 high school longitudinal study
53 nsf survey of graduate students and postdoctorates in science and engineering
54 national science foundation survey of doctorate recipients
55 survey of doctorate recipients
56 coastal change analysis program land cover
57 survey of industrial research and development
58 world ocean database
59 rural urban continuum codes
60 noaa optimum interpolation sea surface temperature
61 noaa world ocean database
62 common core of data
63 higher education research and development survey
64 noaa storm surge inundation
65 national weather service nws storm surge risk
66 survey of science and engineering research facilities
67 nsf survey of industrial research and development
68 national science foundation survey of science and engineering research facilities
69 national science foundation higher education research and development survey
70 national center for science and engineering statistics survey of earned doctorates
71 national center for science and engineering statistics survey of science and engineering research facilities
72 national center for science and engineering statistics survey of graduate students and postdoctorates in science and engineering
73 national center for science and engineering statistics survey of doctorate recipients
74 national center for science and engineering statistics survey of industrial research and development
75 national center for science and engineering statistics higher education research and development survey
76 nsf survey of science and engineering research facilities
77 ffrdc research and development survey
78 nsf ffrdc research and development survey
79 survey of state government research and development
80 ncses survey of doctorate recipients
81 ncses survey of graduate students and postdoctorates in science and engineering
82 anss comprehensive earthquake catalog
83 anss comprehensive catalog
84 advanced national seismic system anss comprehensive catalog comcat 
85 advanced national seismic system comprehensive catalog
86 census of agriculture
87 usda census of agriculture
88 nass census of agriculture
89 north american breeding bird survey
90 north american breeding bird survey bbs 
91 usgs north american breeding bird survey
92 covid 19 open research dataset cord 19 
93 covid 19 open research dataset
94 covid open research dataset
95 covid 19 open research data
96 complexity science hub covid 19 control strategies list cccsl 
97 complexity science hub covid 19 control strategies list
98 cccsl
99 our world in data covid 19 dataset
100 our world in data covid 19
101 our world in data
102 jh crown registry
103 characterizing health associated risks and your baseline disease in sars cov 2 charybdis 
104 characterizing health associated risks and your baseline disease in sars cov 2
105 covid 19 death data
106 sars cov 2 genome sequence
107 sars cov 2 genome sequences
108 covid 19 genome sequence
109 covid 19 genome sequences
110 2019 ncov genome sequence
111 2019 ncov genome sequences
112 sars cov 2 full genome sequence
113 sars cov 2 full genome sequences
114 sars cov 2 complete genome sequence
115 sars cov 2 complete genome sequences
116 2019 ncov complete genome sequences
117 genome sequences of sars cov 2
118 genome sequence of sars cov 2
119 genome sequence of covid 19
120 genome sequences of covid 19
121 genome sequence of 2019 ncov
122 genome sequences of 2019 ncov
123 covid 19 image data collection
124 rsna international covid 19 open radiology database ricord 
125 rsna international covid 19 open radiology database
126 rsna international covid open radiology database
127 cas covid 19 antiviral candidate compounds dataset
128 cas covid 19 antiviral candidate compounds data set
129 cas covid 19 antiviral candidate compounds data

reference

riow1983 commented 3 years ago

cleanded_labelを教師ラベル, pub_category(cleaned_labelをカテゴリ分類したもの)をgroupにして分割.

def get_cv(dataset, num_splits=None, col_target=None, col_group=None):
    """
    Args:
        dataset: pd.DataFrame
        num_splits: int
        col_target: str
        col_group: str
    Returns:
        folds: pd.DataFrame
    """
    X = dataset.index.values
    y = dataset[col_target].values
    groups = dataset[col_group].values

    group_kfold = GroupKFold(n_splits=num_splits)
    group_kfold.get_n_splits(X, y, groups)

    folds = pd.DataFrame()
    for i, (_, test_index) in enumerate(group_kfold.split(X, y, groups)):
        X_test = X[test_index]
        X_test = dataset[dataset.index.isin(X_test)]

        # Concat all and save at once
        X_test["fold"] = i+1
        folds = pd.concat([folds, X_test], ignore_index=True)

    return folds

folds = get_cv(df, num_splits=5, col_target="cleaned_label", col_group="pub_category")
riow1983 commented 3 years ago

@qllolollp @Toru-Ito1 CVデータ作成一応完了です。 https://www.kaggle.com/riow1983/nb009-cv?select=folds_pubcat.pkl

作成ロジックは前回定例会で話した通り: 1) 目視確認よるカテゴライズ 2) 目視カテゴライズしたもの同志でコサイン類似度算出 3) コサイン類似度0.95以上(閾値)のものをedgeで結合しグラフ化 4) edgeで結ばれたものを一つのカテゴリーとして採用 (複数の目視カテゴリを"+"で結合する方針ですが閾値を0.95と高くしたため目視カテゴリからの差分はなさそうです) 5) カテゴリーをgroupとしてGroupKfold

作成ロジック詳細については下記ノートブックをご確認ください: https://github.com/riow1983/Kaggle-Coleridge-Initiative/blob/main/notebooks/nb009-cv.ipynb

問題がなさそうならcloseしますのでフィードバックお待ちしております。

結果一部抜粋:

CV: 1 -------------------------------------------------------------------

#### train pub_category:

 ['aging integrated database' 'agricultural resources management survey'
 'arms farm financial and crop production practices'
 'baccalaureate and beyond longitudinal study'
 'baltimore longitudinal study of aging'
 'beginning postsecondary students' 'breeding bird survey'
 'cas covid 19 antiviral candidate compounds dataset'
 'census of agriculture'
 'characterizing health associated risks and your baseline disease in sars cov 2'
 'coastal change analysis program' 'common core of data'
 'complexity science hub covid 19 control strategies list'
 'covid 19 death data' 'covid 19 genome sequences'
 'covid 19 image data collection' 'covid 19 open research dataset'
 'covid 19 our world in data' 'early childhood longitudinal study'
 'education longitudinal study' 'ffrdc research and development survey'
 'high school longitudinal study'
 'higher education research and development survey'
 'international best track archive for climate stewardship'
 'jh crown registry' 'national assessment of education progress'
 'optimum interpolation sea surface temperature'
 'program for the international assessment of adult competencies'
 'rsna international covid 19 open radiology database'
 'rural urban continuum codes' 'school survey on crime and safety'
 'seismic system comprehensive catalog' 'storm surge risk'
 'survey of doctorate recipients' 'survey of earned doctorates'
 'survey of graduate students and postdoctorates in science and engineering'
 'survey of industrial research and development'
 'survey of science and engineering research facilities'
 'survey of state government research and development'
 'teacher and principal survey'
 'the national institute on aging genetics of alzheimer s disease data storage site'
 'tide station' 'trends in international mathematics and science study'
 'water level observation network' 'world ocean database']

#### dev pub_category:

 ['alzheimers disease neuroimaging initiative']

CV: 2 -------------------------------------------------------------------

#### train pub_category:

 ['agricultural resources management survey'
 'alzheimers disease neuroimaging initiative'
 'arms farm financial and crop production practices'
 'baccalaureate and beyond longitudinal study'
 'beginning postsecondary students'
 'cas covid 19 antiviral candidate compounds dataset'
 'census of agriculture'
 'characterizing health associated risks and your baseline disease in sars cov 2'
 'coastal change analysis program' 'common core of data'
 'complexity science hub covid 19 control strategies list'
 'covid 19 death data' 'covid 19 genome sequences'
 'covid 19 image data collection' 'covid 19 open research dataset'
 'covid 19 our world in data' 'early childhood longitudinal study'
 'education longitudinal study' 'ffrdc research and development survey'
 'higher education research and development survey' 'jh crown registry'
 'national assessment of education progress' 'rural urban continuum codes'
 'school survey on crime and safety' 'storm surge risk'
 'survey of doctorate recipients' 'survey of earned doctorates'
 'survey of graduate students and postdoctorates in science and engineering'
 'survey of industrial research and development'
 'survey of state government research and development'
 'teacher and principal survey'
 'the national institute on aging genetics of alzheimer s disease data storage site'
 'tide station' 'trends in international mathematics and science study'
 'water level observation network']

#### dev pub_category:

 ['aging integrated database' 'baltimore longitudinal study of aging'
 'breeding bird survey' 'high school longitudinal study'
 'international best track archive for climate stewardship'
 'optimum interpolation sea surface temperature'
 'program for the international assessment of adult competencies'
 'rsna international covid 19 open radiology database'
 'seismic system comprehensive catalog'
 'survey of science and engineering research facilities'
 'world ocean database']

CV: 3 -------------------------------------------------------------------

#### train pub_category:

 ['aging integrated database' 'alzheimers disease neuroimaging initiative'
 'arms farm financial and crop production practices'
 'baltimore longitudinal study of aging'
 'beginning postsecondary students' 'breeding bird survey'
 'census of agriculture'
 'characterizing health associated risks and your baseline disease in sars cov 2'
 'coastal change analysis program'
 'complexity science hub covid 19 control strategies list'
 'covid 19 death data' 'covid 19 genome sequences'
 'covid 19 image data collection' 'covid 19 open research dataset'
 'covid 19 our world in data' 'early childhood longitudinal study'
 'ffrdc research and development survey' 'high school longitudinal study'
 'international best track archive for climate stewardship'
 'jh crown registry' 'optimum interpolation sea surface temperature'
 'program for the international assessment of adult competencies'
 'rsna international covid 19 open radiology database'
 'seismic system comprehensive catalog' 'storm surge risk'
 'survey of doctorate recipients' 'survey of earned doctorates'
 'survey of graduate students and postdoctorates in science and engineering'
 'survey of science and engineering research facilities'
 'teacher and principal survey' 'tide station'
 'trends in international mathematics and science study'
 'water level observation network' 'world ocean database']

#### dev pub_category:

 ['agricultural resources management survey'
 'baccalaureate and beyond longitudinal study'
 'cas covid 19 antiviral candidate compounds dataset'
 'common core of data' 'education longitudinal study'
 'higher education research and development survey'
 'national assessment of education progress' 'rural urban continuum codes'
 'school survey on crime and safety'
 'survey of industrial research and development'
 'survey of state government research and development'
 'the national institute on aging genetics of alzheimer s disease data storage site']

CV: 4 -------------------------------------------------------------------

#### train pub_category:

 ['aging integrated database' 'agricultural resources management survey'
 'alzheimers disease neuroimaging initiative'
 'arms farm financial and crop production practices'
 'baccalaureate and beyond longitudinal study'
 'baltimore longitudinal study of aging' 'breeding bird survey'
 'cas covid 19 antiviral candidate compounds dataset'
 'characterizing health associated risks and your baseline disease in sars cov 2'
 'coastal change analysis program' 'common core of data'
 'covid 19 genome sequences' 'covid 19 our world in data'
 'early childhood longitudinal study' 'education longitudinal study'
 'ffrdc research and development survey' 'high school longitudinal study'
 'higher education research and development survey'
 'international best track archive for climate stewardship'
 'national assessment of education progress'
 'optimum interpolation sea surface temperature'
 'program for the international assessment of adult competencies'
 'rsna international covid 19 open radiology database'
 'rural urban continuum codes' 'school survey on crime and safety'
 'seismic system comprehensive catalog' 'storm surge risk'
 'survey of earned doctorates'
 'survey of industrial research and development'
 'survey of science and engineering research facilities'
 'survey of state government research and development'
 'teacher and principal survey'
 'the national institute on aging genetics of alzheimer s disease data storage site'
 'water level observation network' 'world ocean database']

#### dev pub_category:

 ['beginning postsecondary students' 'census of agriculture'
 'complexity science hub covid 19 control strategies list'
 'covid 19 death data' 'covid 19 image data collection'
 'covid 19 open research dataset' 'jh crown registry'
 'survey of doctorate recipients'
 'survey of graduate students and postdoctorates in science and engineering'
 'tide station' 'trends in international mathematics and science study']

CV: 5 -------------------------------------------------------------------

#### train pub_category:

 ['aging integrated database' 'agricultural resources management survey'
 'alzheimers disease neuroimaging initiative'
 'baccalaureate and beyond longitudinal study'
 'baltimore longitudinal study of aging'
 'beginning postsecondary students' 'breeding bird survey'
 'cas covid 19 antiviral candidate compounds dataset'
 'census of agriculture' 'common core of data'
 'complexity science hub covid 19 control strategies list'
 'covid 19 death data' 'covid 19 image data collection'
 'covid 19 open research dataset' 'education longitudinal study'
 'high school longitudinal study'
 'higher education research and development survey'
 'international best track archive for climate stewardship'
 'jh crown registry' 'national assessment of education progress'
 'optimum interpolation sea surface temperature'
 'program for the international assessment of adult competencies'
 'rsna international covid 19 open radiology database'
 'rural urban continuum codes' 'school survey on crime and safety'
 'seismic system comprehensive catalog' 'survey of doctorate recipients'
 'survey of graduate students and postdoctorates in science and engineering'
 'survey of industrial research and development'
 'survey of science and engineering research facilities'
 'survey of state government research and development'
 'the national institute on aging genetics of alzheimer s disease data storage site'
 'tide station' 'trends in international mathematics and science study'
 'world ocean database']

#### dev pub_category:

 ['arms farm financial and crop production practices'
 'characterizing health associated risks and your baseline disease in sars cov 2'
 'coastal change analysis program' 'covid 19 genome sequences'
 'covid 19 our world in data' 'early childhood longitudinal study'
 'ffrdc research and development survey' 'storm surge risk'
 'survey of earned doctorates' 'teacher and principal survey'
 'water level observation network']
riow1983 commented 3 years ago

@qllolollp pub_categoryの中身がおかしいかもしれない件について、以下notebookで確認しましたが、おかしなところは無いように見えますが、いかがでしょう? https://www.kaggle.com/riow1983/kagglenb011-check-cv-data

qllolollp commented 3 years ago

同じIdの論文が複数ラベルを持っている場合、複数のfoldに含まれることがあるようです。
cv_folds[cv_folds.duplicated(subset="Id") & ~(cv_folds.duplicated(subset=["Id", "fold"]))]で確認できます。
(例えば"7dd31a80-2389-4c66-a041-29367e109f87"など)

riow1983 commented 3 years ago

CV分割をする前に, train.csvで同一Idで複数行ある(=複数のcleaned_labelがある)ものについては, 1行に集約するか, 中間カテゴリに落ちるなどの処理を加えてみます.

riow1983 commented 3 years ago

@qllolollp 上で予告していた処理完了しました. 作成ノートブック: https://github.com/riow1983/Kaggle-Coleridge-Initiative/blob/main/notebooks/nb009-cv.ipynb データセット version 3: https://www.kaggle.com/riow1983/nb009-cv

riow1983 commented 3 years ago

df["pub_title"].nunique() < df["Id"].nunique() であることからIdでユニークにするよりpub_titleでユニークにした方が良さそうなので修正します.

riow1983 commented 3 years ago

上記修正完了しました. 作成ノートブック: https://github.com/riow1983/Kaggle-Coleridge-Initiative/blob/main/notebooks/nb009-cv.ipynb データセット version 4: https://www.kaggle.com/riow1983/nb009-cv

riow1983 commented 3 years ago

@qllolollp 同一pub_titleで複数のpub_categoryを持つものは, 例えば "adni + noaa + slosh" などとなってますが, これら全ての組み合わせ

"adni", 
"noaa", 
"slosh", 
"adni + noaa", 
"adni + slosh", 
"noaa + slosh", 
"adni + noaa + slosh"

について、一つのカテゴリ(例えば"adni + noaa + slosh")に落ちるように変換してからCVを切る、ならひとまず目的は達成されると思いましたがいかがでしょう?

qllolollp commented 3 years ago

"adni + hoge"のpub_titleがあったら、カテゴリは"adni + noaa + slosh + hoge"になるんですよね?
全部の複数label持ちをケアしようとすると結局一つのカテゴリが巨大になりすぎてしまうのではないかと懸念してましたが、領域が違うとlabelも異なるはずなので、案外きれいに分かれるのかもしれないという気もしてきました。

あと自分が使う上では、現状のtestセットから複数label持ちを抜くだけの応急処置でもそんなに困ってないです。

riow1983 commented 3 years ago

@qllolollp

"adni + hoge"のpub_titleがあったら、カテゴリは"adni + noaa + slosh + hoge"になるんですよね?

はい、そうなります。 確かに1つのカテゴリが幅広になって一見して何の話題なのか分かりにくくなるという副作用はあると思います。 ただCVに関する目的(同じlabelを持つ論文がtrainとvalid両方に現れることを防ぐ)は達成できるはずなので、やってみたいと思います。

riow1983 commented 3 years ago

@qllolollp 上記修正完了しました. pub_category(=group)の数は一挙に12にまで削減されました. 作成ノートブック: https://github.com/riow1983/Kaggle-Coleridge-Initiative/blob/main/notebooks/nb009-cv.ipynb データセット version 5: https://www.kaggle.com/riow1983/nb009-cv

riow1983 commented 3 years ago

@qllolollp 上記修正完了しました. pub_category(=group)の数は一挙に12にまで削減されました. 作成ノートブック: https://github.com/riow1983/Kaggle-Coleridge-Initiative/blob/main/notebooks/nb009-cv.ipynb データセット version 5: https://www.kaggle.com/riow1983/nb009-cv

foldによってはvalidのobs数が50程度にまで落ち込んでいたためversion 5では完了できない. version 4であれば, post-processingでtrainに現れたcleaned_labelを持つインスタンスがvalid側にも現れた時にそれをdropする処理をした上でかつ満足なobs数を確保できるため, 一旦はそれに戻す.

riow1983 commented 3 years ago

@qllolollp すみません、version4に戻した上でpost processing (trainに現れたcleaned_labelを持つインスタンスがvalidに現れた場合はdropする処理)も加えたものをdatasetとしてシェアしたいと思っているのですが、自分でこの処理をやってみるとtrian/validのobs数がfoldによってはvalidがかなり削られてしまいました。

fold 1
len(train):  14054
len(valid):  217

fold 2
len(train):  12998
len(valid):  1273

fold 3
len(train):  14252
len(valid):  19

fold 4
len(train):  13625
len(valid):  646

fold 5
len(train):  11653
len(valid):  2618

加えた処理は以下の関数になります:

def post_process(df, drop=False):
    for i in range(5):
        train = df[df["fold"] != i+1]
        dev = df[df["fold"] == i+1]

        obs_labels_train = set()
        for _,row in tqdm(train.iterrows(), desc=f"processing for fold {i+1}..."):
            for cl in row["cleaned_label"].split("|"):
                obs_labels_train.add(cl)

        dev["cleaned_label"] = dev["cleaned_label"].apply(lambda x: "to_train" if len(set(x.split("|")).intersection(obs_labels_train))>0 else x)
        real_dev = dev[dev["cleaned_label"]!="to_train"].reset_index(drop=True)
        dev2train = dev[dev["cleaned_label"]=="to_train"].reset_index(drop=True)
        dev2train["fold"] = dev2train["fold"]+100 # update fold number
        print(f"#### fold {i+1} ... {len(dev2train)/len(df)} % of observations are rejected as dev.")
        if drop:
            print(f"len(train): {len(train)}, len(dev): {len(real_dev)}")
            df = pd.concat([train, real_dev], axis=0, ignore_index=True)
        else:
            print(f"len(train): {len(train)+len(dev2train)}, len(dev): {len(real_dev)}")
            df = pd.concat([train, dev2train, real_dev], axis=0, ignore_index=True)

    return df

folds = post_process(folds, drop=False)

drop=Falseとしてdev側として採用を見送ったインスタンスはtrian側に戻すという方法を取っています。 ちなみにtrain側に戻さず捨てる場合はdrop=Trueとして実行できます。結果は:

fold 1
len(train):  4556
len(valid):  217

fold 2
len(train):  3500
len(valid):  1273

fold 3
len(train):  4754
len(valid):  19

fold 4
len(train):  4127
len(valid):  646

fold 5
len(train):  2155
len(valid):  2618

いずれにしてもvalidがこんなに削られるのはおかしいと感じてますが、アドバイスいただけると幸いです。

riow1983 commented 3 years ago

@qllolollp @Toru-Ito1 post processingについて上記の問題がありましたので一旦スキップし, Kaggle Dataset "nb009"はversion4と同等のものに戻しました. version6がそれです.

作成ノートブック: https://github.com/riow1983/Kaggle-Coleridge-Initiative/blob/main/notebooks/nb009-cv.ipynb データセット version 6: https://www.kaggle.com/riow1983/nb009-cv

qllolollp commented 3 years ago

すみません、自分が応急処置的に抜く処理をやっていると言っていたのは以下の問題と混同していて(この問題自体は結構前に解決されてますが)、同じlabelがtrainとvalidに含まれる問題に対しては自分も対処出来てなかったです。

同じIdの論文が複数ラベルを持っている場合、複数のfoldに含まれることがあるようです。 cv_folds[cv_folds.duplicated(subset="Id") & ~(cv_folds.duplicated(subset=["Id", "fold"]))]で確認できます。 (例えば"7dd31a80-2389-4c66-a041-29367e109f87"など)

riow1983 commented 3 years ago

@qllolollp なるほど了解しました。 何か思いつくまでは一旦version6 (=version4)で据置ますが、さてどうするか。

riow1983 commented 3 years ago

@qllolollp たびたびすみません、何かアイディアが浮かんだら教えてください。

qllolollp commented 3 years ago

自分でもちょこちょこ試してみて、件数が上で堀内さんが挙げているものとは一致してないのですが、以下のような発見がありました。
fold=1にcleaned_labelが"adni"のものがあって(fold=1のほとんど)、fold=3に"baltimore longitudinal study of aging blsa |adni"があるので、fold=1のvalidがスカスカになってしまうのは仕方ないかと。
そういう意味ではVersion5は完全に分離はできているので、件数のバランスは悪いですが(8,132/4,968/1,065/53/53)、どうせ5fold分回す暇もないですし、例えばfold=3の1,065件をvalidにしたパターン一発で未知データに対する精度を見るのはどうでしょうか?

riow1983 commented 3 years ago

それが現実的かもですね。 @Toru-Ito1 異論なければversion5に戻します。

Toru-Ito1 commented 3 years ago

はい、異論ないです。よろしくお願いいたします。

riow1983 commented 3 years ago

@qllolollp @Toru-Ito1 了解しました。version5に戻します。また追って展開します。

riow1983 commented 3 years ago

@qllolollp @Toru-Ito1 Kaggle Dataset "nb009"はversion5と同等のものに戻しました. version7がそれです.

作成ノートブック: https://github.com/riow1983/Kaggle-Coleridge-Initiative/blob/main/notebooks/nb009-cv.ipynb データセット version 7: https://www.kaggle.com/riow1983/nb009-cv


version 7の各foldごとのtrain/valid内訳は以下の通りです:

fold 1
len(train):  6139
len(valid):  8132

fold 2
len(train):  9303
len(valid):  4968

fold 3
len(train):  13206
len(valid):  1065

fold 4
len(train):  14218
len(valid):  53

fold 5
len(train):  14218
len(valid):  53



この内参考までにfold 2, 3のカテゴリ内訳を載せておきます. fold 3のdev側はCOVID19関連のものだけになってます.

CV: 2 -------------------------------------------------------------------

#### train pub_category:

 ['aging integrated database'
 'agricultural resources management survey + arms farm financial and crop production practices + baccalaureate and beyond longitudinal study + beginning postsecondary students + breeding bird survey + census of agriculture + coastal change analysis program + common core of data + early childhood longitudinal study + education longitudinal study + ffrdc research and development survey + high school longitudinal study + higher education research and development survey + international best track archive for climate stewardship + national assessment of education progress + optimum interpolation sea surface temperature + program for the international assessment of adult competencies + rural urban continuum codes + school survey on crime and safety + storm surge risk + survey of doctorate recipients + survey of earned doctorates + survey of graduate students and postdoctorates in science and engineering + survey of industrial research and development + survey of science and engineering research facilities + survey of state government research and development + teacher and principal survey + tide station + trends in international mathematics and science study + water level observation network + world ocean database'
 'cas covid 19 antiviral candidate compounds dataset'
 'characterizing health associated risks and your baseline disease in sars cov 2'
 'complexity science hub covid 19 control strategies list'
 'covid 19 death data'
 'covid 19 genome sequences + covid 19 open research dataset + covid 19 our world in data'
 'covid 19 image data collection' 'jh crown registry'
 'rsna international covid 19 open radiology database'
 'seismic system comprehensive catalog']

#### dev pub_category:

 ['alzheimers disease neuroimaging initiative + baltimore longitudinal study of aging + the national institute on aging genetics of alzheimer s disease data storage site']

CV: 3 -------------------------------------------------------------------

#### train pub_category:

 ['aging integrated database'
 'agricultural resources management survey + arms farm financial and crop production practices + baccalaureate and beyond longitudinal study + beginning postsecondary students + breeding bird survey + census of agriculture + coastal change analysis program + common core of data + early childhood longitudinal study + education longitudinal study + ffrdc research and development survey + high school longitudinal study + higher education research and development survey + international best track archive for climate stewardship + national assessment of education progress + optimum interpolation sea surface temperature + program for the international assessment of adult competencies + rural urban continuum codes + school survey on crime and safety + storm surge risk + survey of doctorate recipients + survey of earned doctorates + survey of graduate students and postdoctorates in science and engineering + survey of industrial research and development + survey of science and engineering research facilities + survey of state government research and development + teacher and principal survey + tide station + trends in international mathematics and science study + water level observation network + world ocean database'
 'alzheimers disease neuroimaging initiative + baltimore longitudinal study of aging + the national institute on aging genetics of alzheimer s disease data storage site'
 'cas covid 19 antiviral candidate compounds dataset'
 'characterizing health associated risks and your baseline disease in sars cov 2'
 'complexity science hub covid 19 control strategies list'
 'covid 19 death data' 'covid 19 image data collection'
 'jh crown registry' 'rsna international covid 19 open radiology database'
 'seismic system comprehensive catalog']

#### dev pub_category:

 ['covid 19 genome sequences + covid 19 open research dataset + covid 19 our world in data']