Training the model for varmisuse task

egor-bogomolov commented 4 years ago

Hey! I tried to run training of the varmisuse model in order to explore how it works on data from unseen projects. I have a few questions regarding it:

Seems like the dataset format has changed compared to the published version of data. I've found the following issue in another repository. Unfortunately, I had already reorganized data before finding the issue: converted json files into jsonlines and changed structure from project/{train|test|valid}/files to {train|test|valid}/files. It would be nice to either duplicate the reorganizing script to this repo, or add a link to the issue in README.
After reorganizing the data, I tried to run training with default settings (minibatch size = 300) on an instance with 94 GB RAM and 48 CPUs. The instance doesn't have GPU because I wanted to measure the memory usage so that I can allocate a proper GPU instance afterward. Unfortunately, training fails with OOM error, because it quickly utilizes 94 GB and asks for more. Moreover, I've tried to create a smaller version of the dataset by picking only 1 project from train/validation/test, and it didn't really help: with a minibatch size of 100 and a single project in train part I still got OOM. Is it expected behavior?
Which instance do you recommend for training the model? In particular, how much RAM do I need and how long does the training take on, let's say, V100?
Do you have a pre-trained model that you can share? Maybe I can avoid the training at all and just run the already trained model on different data.

Thanks a lot in advance and thanks for great projects and papers!

mallamanis commented 4 years ago

Hi Egor,

Something must be going wrong. I can train this on with about 3GB or RAM and about 60% of a K80.

Including the scripts makes a lot of sense, but first lets debug the issue you're seeing...

#!/bin/bash

# Script assumes you're in graph-dataset

mkdir -p train valid test
mv ./graph-dataset/*/graphs-train/* train
mv ./graph-dataset/*/graphs-valid/* valid
mv ./graph-dataset/*/graphs-test/* test

mkdir -p train-out valid-out test-out
python convert.py ./train/ ./train-out
python convert.py ./valid/ ./valid-out
python convert.py ./test/ ./test-out

convert.py

import sys
import os
from glob import iglob

from dpu_utils.utils import load_json_gz, save_jsonl_gz

for file in iglob(os.path.join(sys.argv[1], "*.gz")):
    filename = os.path.basename(file)[:-len(".gz")] + ".jsonl.gz"
    target_path = os.path.join(sys.argv[2], filename)
    print(f"Converting {file} to {target_path}.")

    save_jsonl_gz(load_json_gz(file), target_path)

is this similar to what you're using?

mallamanis commented 4 years ago

I just run this in the CPU and I can replicate this issue... I assume that the problem is that pyTorch fuses some operations in the GPU but not in the CPU for the character CNN. If you change the model to use subtokens (change "char" to "subtoken" here ), then the problem goes away.

The performance of subtoken/char models is fairly similar, so this might be good enough for now. I'll try to investigate why the charCNN has such a terrible performance on CPU, hopefully next week...

egor-bogomolov commented 4 years ago

@mallamanis thanks a lot for the lightning-fast reply!

The convert script is very similar to my one. I will try to use subtoken model and report the results.

egor-bogomolov commented 4 years ago

Sorry for the late reply. Everything indeed works well when I changed char to subtoken.

Do you have any insights on how many epochs does the model with default settings need to converge?

mallamanis commented 4 years ago

I don't remember and I don't have a recent run around. The code uses early stopping, so you don't need to explicitly wait for convergence: when it has converged training will automatically stop.

mallamanis commented 4 years ago

@egor-bogomolov Closing this issue for now. Happy to re-open this, if needed.

microsoft / ptgnn

Training the model for varmisuse task #1