Open idekazuki opened 4 years ago
上に示した条件で学習することができたので、これからはFinetuing用にモデルを調整する。
まず今回は、動詞のみの学習ができるように調整する。
最終層の構造は次のようになっている。
(head): ResNetBasicHead(
(pathway0_avgpool): AvgPool3d(kernel_size=[8, 7, 7], stride=1, padding=0)
(pathway1_avgpool): AvgPool3d(kernel_size=[32, 7, 7], stride=1, padding=0)
(dropout): Dropout(p=0.5, inplace=False)
(projection): Linear(in_features=2304, out_features=400, bias=True)
(act): Softmax(dim=4)
model.module.head.projection = nn.Linear(in_features=2304, out_features=125, bias=True)
self.projection = nn.Linear(sum(dim_in), num_classes, bias=True)↲
(verb: 125 class, noun: 352 class)
これを何処かに挿入すれば動詞は学習できるはず。
num_feature = model.module.head.projection.in_features↲
logger.info("#########################{}".format(num_feature))↲
model.module.head.projection = nn.Linear(num_feature, 125, bias=True)↲
logger.info("model{}".format(model))↲
これが原因で動いてなかった。 以前追加したものを消すの忘れてた。
おそらくこの部分でcheck point loadをしているのでこれ以降をいじればモデルの改変ができる。
https://github.com/facebookresearch/SlowFast/blob/25a0f633b7/tools/train_net.py#L261
まずは次のラインにFinetuing用のコードを挿入した。 https://github.com/facebookresearch/SlowFast/blob/25a0f633b7/tools/train_net.py#L274
if cfg.TRAIN.DATASET == 'epic':
logger.info("### model_module {}".format(model.module.head))
num_feature = model.module.head.projection.in_features
model.module.head.projection = nn.Linear(num_feature, 125, bias=True)
python tools/run_net.py --cfg configs/Kinetics/c2/SLOWFAST_8x8_R50.yaml DATA.PATH_TO_DATA_DIR /home/yanai-lab/ide-k/ide-k/epic/data/processed/gulp TRAIN.CHECKPOINT_FILE_PATH ./checkpoints/SLOWFAST_8x8_R50.pkl TRAIN.ENABLE True NUM_GPUS 10 TRAIN.CHECKPOINT_TYPE caffe2 TEST.CHECKPOINT_TYPE caffe2 TRAIN.CHECKPOINT_INFLATE True TRAIN.DATASET epic
まだFinetuing用にコードを改変したあと、上記のコマンドを検証したところ次のような結果となった。
from slowfast.datasets.epic
[INFO: epic.py: 48]: Constructing Epic train...
[INFO: train_net.py: 278]: ##########model_moduleResNetBasicHead(
(pathway0_avgpool): AvgPool3d(kernel_size=[8, 7, 7], stride=1, padding=0)
(pathway1_avgpool): AvgPool3d(kernel_size=[32, 7, 7], stride=1, padding=0)
(dropout): Dropout(p=0.5, inplace=False)
(projection): Linear(in_features=2304, out_features=400, bias=True)
(act): Softmax(dim=4)
)
[INFO: train_net.py: 281]: #####model_moduleResNetBasicHead(
(pathway0_avgpool): AvgPool3d(kernel_size=[8, 7, 7], stride=1, padding=0)
(pathway1_avgpool): AvgPool3d(kernel_size=[32, 7, 7], stride=1, padding=0)
(dropout): Dropout(p=0.5, inplace=False)
(projection): Linear(in_features=2304, out_features=125, bias=True)
(act): Softmax(dim=4)
)
from slowfast.datasets.epic
[INFO: epic.py: 48]: Constructing Epic train...
[INFO: epic.py: 77]: Constructing epic dataloader (size: 28472) from /home/yanai-lab/ide-k/ide-k/epic/data/processed/gulp/rgb_train
[INFO: epic.py: 77]: Constructing epic dataloader (size: 28472) from /home/yanai-lab/ide-k/ide-k/epic/data/processed/gulp/rgb_train
Traceback (most recent call last):
File "tools/run_net.py", line 152, in <module>
main()
File "tools/run_net.py", line 124, in main
daemon=False,
File "/home/yanai-lab/ide-k/ide-k/pyenv/slowfast/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/home/yanai-lab/ide-k/ide-k/pyenv/slowfast/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 6 terminated with the following error:
Traceback (most recent call last):
File "/home/yanai-lab/ide-k/ide-k/pyenv/slowfast/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/yanai-lab/ide-k/ide-k/out_git/SlowFast-master/slowfast/utils/multiprocessing.py", line 50, in run
func(cfg)
File "/host/space0/ide-k/out_git/SlowFast-master/tools/train_net.py", line 285, in train
train_loader = loader.construct_loader(cfg, "train")
File "/home/yanai-lab/ide-k/ide-k/out_git/SlowFast-master/slowfast/datasets/loader.py", line 82, in construct_loader
sampler = DistributedSampler(dataset) if cfg.NUM_GPUS > 1 else None
File "/home/yanai-lab/ide-k/ide-k/pyenv/slowfast/lib/python3.7/site-packages/torch/utils/data/distributed.py", line 39, in __init__
self.num_samples = int(math.ceil(len(self.dataset) * 1.0 / self.num_replicas))
File "/home/yanai-lab/ide-k/ide-k/out_git/SlowFast-master/slowfast/datasets/epic.py", line 161, in __len__
return len(self._path_to_videos)
AttributeError: 'Epic' object has no attribute '_path_to_videos'
エラーを見るとどうやらlenの部分をKineticsのコピーで済ませてしまった影響が出た。
def __len__(self):
return len(self.gdict)
に変更。
変更後再度実行したところ学習が始まったように見えたが、エラーが出てきた。内容的にはmemoryエラーだが、trainloaderの出力が文字列データで調整をしていないのでこれを直してから再チャレンジ。
問題点: - epicのtrain data とvalidation dataの分け方の変更。 epic datasetにはtrain dataとtest datasetの二種類しかないのでuserがtrain datasetをtrain, valに分ける必要がある。今回は再現性を高めるためにk-fold validationではなく、random seedでrandomを固定して、8:2でtrain, valを分けて毎回同じ分け方で学習を行う。
- train modelの変更場所について 現在モデルの変更を行っているのは、model.module.head.projectionの部分。 projectionの部分をnn.Linear()で置き換えて出力を125にすることでverbの学習を行うことができるようにした。 しかし、最終的にはverb, noun 同時に学習する必要があるので、projectionの部分ではなく、actの部分のsoftmax を改変することで同時学習が可能になると考えられる。現在わかっていないのが、層の出力を変化させるだけではなく、forwardの部分も変化させる方法である。仮説としてうまくいきそうなのが、head の部分をまるごと自分のモジュールに置き換える方法。
- Process 4 terminated with the following error:
Traceback (most recent call last):
File "/home/yanai-lab/ide-k/ide-k/pyenv/slowfast/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/yanai-lab/ide-k/ide-k/out_git/SlowFast-master/slowfast/utils/multiprocessing.py", line 50, in run
func(cfg)
File "/host/space0/ide-k/out_git/SlowFast-master/tools/train_net.py", line 303, in train
train_epoch(train_loader, model, optimizer, train_meter, cur_epoch, cfg)
File "/host/space0/ide-k/out_git/SlowFast-master/tools/train_net.py", line 44, in train_epoch
for cur_iter, (inputs, labels, _, meta) in enumerate(train_loader):
File "/home/yanai-lab/ide-k/ide-k/pyenv/slowfast/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 279, in __iter__
return _MultiProcessingDataLoaderIter(self)
File "/home/yanai-lab/ide-k/ide-k/pyenv/slowfast/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 719, in __init__
w.start()
File "/home/yanai-lab/ide-k/ide-k/pyenv/slowfast/lib/python3.7/multiprocessing/process.py", line 112, in start
self._popen = self._Popen(self)
File "/home/yanai-lab/ide-k/ide-k/pyenv/slowfast/lib/python3.7/multiprocessing/context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/home/yanai-lab/ide-k/ide-k/pyenv/slowfast/lib/python3.7/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/home/yanai-lab/ide-k/ide-k/pyenv/slowfast/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/home/yanai-lab/ide-k/ide-k/pyenv/slowfast/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
self._launch(process_obj)
File "/home/yanai-lab/ide-k/ide-k/pyenv/slowfast/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/home/yanai-lab/ide-k/ide-k/pyenv/slowfast/lib/python3.7/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
MemoryError
/home/yanai-lab/ide-k/ide-k/pyenv/slowfast/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 29 leaked semaphores to clean up at shutdown
解決策: train, valの分割について: dataset側で分割を行うのではなく、train, valは同じデータセットから読み込んでdataloader側で分割を行う。例:train_test_split (from sklearn)
https://github.com/facebookresearch/SlowFast/blob/25a0f633b7/tools/train_net.py#L275 このラインにコードを挿入する。
if cfg.TRAIN.DATASET == "epic":
train_val_loader.construct_loader(cfg, "train")
batch_size = 16
validation_split = .2
shuffle_dataset = True
random_seed= 42
# Creating data indices for training and validation splits:
dataset_size = len(dataset)
indices = list(range(dataset_size))
split = int(np.floor(validation_split * dataset_size))
if shuffle_dataset :
np.random.seed(random_seed)
np.random.shuffle(indices)
train_indices, val_indices = indices[split:], indices[:split]
# Creating PT data samplers and loaders:
train_sampler = SubsetRandomSampler(train_indices)
valid_sampler = SubsetRandomSampler(val_indices)
train_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size,
sampler=train_sampler)
validation_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size,
sampler=valid_sampler)
else:
train_loader = loader.construct_loader(cfg, "train")
val_loader = loader.construct_loader(cfg, "val")
結局loaderをいじって次のようなエラーがでた。
Traceback (most recent call last):
File "tools/run_net.py", line 152, in <module>
main()
File "tools/run_net.py", line 124, in main
daemon=False,
File "/home/yanai-lab/ide-k/ide-k/pyenv/slowfast/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/home/yanai-lab/ide-k/ide-k/pyenv/slowfast/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 7 terminated with the following error:
Traceback (most recent call last):
File "/home/yanai-lab/ide-k/ide-k/pyenv/slowfast/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/yanai-lab/ide-k/ide-k/out_git/SlowFast-master/slowfast/utils/multiprocessing.py", line 50, in run
func(cfg)
File "/host/space0/ide-k/out_git/SlowFast-master/tools/train_net.py", line 303, in train
train_epoch(train_loader, model, optimizer, train_meter, cur_epoch, cfg)
File "/host/space0/ide-k/out_git/SlowFast-master/tools/train_net.py", line 69, in train_epoch
preds = model(inputs)
File "/home/yanai-lab/ide-k/ide-k/pyenv/slowfast/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/yanai-lab/ide-k/ide-k/pyenv/slowfast/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 447, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/yanai-lab/ide-k/ide-k/pyenv/slowfast/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/yanai-lab/ide-k/ide-k/out_git/SlowFast-master/slowfast/models/video_model_builder.py", line 362, in forward
x = self.s1(x)
File "/home/yanai-lab/ide-k/ide-k/pyenv/slowfast/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/yanai-lab/ide-k/ide-k/out_git/SlowFast-master/slowfast/models/stem_helper.py", line 92, in forward
), "Input tensor does not contain {} pathway".format(self.num_pathways)
AssertionError: Input tensor does not contain 2 pathway
関係ありそうな部分 https://github.com/facebookresearch/SlowFast/blob/25a0f633b7/slowfast/datasets/kinetics.py#L209 slow path とfast pathの2つを用意する関数をコメントアウトにしていたのが原因だった。 これでうまくいくと思う。
epicは出力のFC層がverb, nounの2種類あるので、モデルの最終層と、lossをいじる必要がある。 epic-kitchen action recognitionの公式実装を参考にする。
モデルの構造: https://github.com/epic-kitchens/action-models/blob/master/tsn.py#L49
モデルのクラスを設定するときに、引数としてnum_class に次のようなclass_countが入る。
verb_class_count, noun_class_count = 125, 352
class_count = (verb_class_count, noun_class_count)
lossの構造: https://github.com/epic-kitchens/action-models/blob/master/tsn.py#L403
テスト実装を次のcolabratoryに書いていく。 https://colab.research.google.com/drive/1Yt0wNPd-TfDb3f9vbXMbUmSBl1ZaGAmL
まずは通常のKinetics datasetでの学習のコマンドについて。
--cfg configs/Kinetics/c2/SLOWFAST_8x8_R50.yaml DATA.PATH_TO_DATA_DIR dataset/ TRAIN.CHECKPOINT_FILE_PATH ./checkpoints/SLOWFAST_8x8_R50.pkl TRAIN.ENABLE True NUM_GPUS 10 TRAIN.CHECKPOINT_TYPE caffe2 TEST.CHECKPOINT_TYPE caffe2 TRAIN.CHECKPOINT_INFLATE True TRAIN.DATASET epic
今回は10GPU搭載のgp38を使用。GPUが2個などの少ないものだとメモリが少ないと怒られる。TRAIN.CHECKPOINT_INFLATE がTrueになっていると、checkpointから読み込んだモデルデータから学習によって重みを変更してくれる。Finetuing用のモード。
ただし、メモリの読み込みがすごく遅いのか、とても時間がかかる。