Closed egor-bogomolov closed 4 years ago
Hi Egor,
Something must be going wrong. I can train this on with about 3GB or RAM and about 60% of a K80.
Including the scripts makes a lot of sense, but first lets debug the issue you're seeing...
#!/bin/bash
# Script assumes you're in graph-dataset
mkdir -p train valid test
mv ./graph-dataset/*/graphs-train/* train
mv ./graph-dataset/*/graphs-valid/* valid
mv ./graph-dataset/*/graphs-test/* test
mkdir -p train-out valid-out test-out
python convert.py ./train/ ./train-out
python convert.py ./valid/ ./valid-out
python convert.py ./test/ ./test-out
convert.py
import sys
import os
from glob import iglob
from dpu_utils.utils import load_json_gz, save_jsonl_gz
for file in iglob(os.path.join(sys.argv[1], "*.gz")):
filename = os.path.basename(file)[:-len(".gz")] + ".jsonl.gz"
target_path = os.path.join(sys.argv[2], filename)
print(f"Converting {file} to {target_path}.")
save_jsonl_gz(load_json_gz(file), target_path)
is this similar to what you're using?
I just run this in the CPU and I can replicate this issue... I assume that the problem is that pyTorch fuses some operations in the GPU but not in the CPU for the character CNN. If you change the model to use subtokens (change "char"
to "subtoken"
here ), then the problem goes away.
The performance of subtoken/char models is fairly similar, so this might be good enough for now. I'll try to investigate why the charCNN has such a terrible performance on CPU, hopefully next week...
@mallamanis thanks a lot for the lightning-fast reply!
The convert script is very similar to my one. I will try to use subtoken model and report the results.
Sorry for the late reply. Everything indeed works well when I changed char
to subtoken
.
Do you have any insights on how many epochs does the model with default settings need to converge?
I don't remember and I don't have a recent run around. The code uses early stopping, so you don't need to explicitly wait for convergence: when it has converged training will automatically stop.
@egor-bogomolov Closing this issue for now. Happy to re-open this, if needed.
Hey! I tried to run training of the varmisuse model in order to explore how it works on data from unseen projects. I have a few questions regarding it:
project/{train|test|valid}/files
to{train|test|valid}/files
. It would be nice to either duplicate the reorganizing script to this repo, or add a link to the issue in README.Thanks a lot in advance and thanks for great projects and papers!