why training is always killed without any error information

jdxyw commented 6 years ago

Hi,

My training is always killed without any error information like below.

uncomitted changes being stored as patches
New TrainingRun created at: /data/edit_runs/7
Optimized batches: reduced cost from 45709568 (naive) to 20758016 (0.545871533942% reduction).
Optimal (batch_size=1) would be 20741962.
Passed batching test
Streaming training examples:   6%|5         | 399/7032 [48:47<12:31:31,  6.80s/it]Killed

adempsey commented 6 years ago

I am encountering this issue as well. Running with the edit_logp config, the process is consistently killed at the same point with the following output:

[localhost] local: wc -l /data/yelp_dataset_large_split/train.tsv
Reading data file.:  20%|#############4
Reading data file.:  26%|#################3
Killed

The same issue is occurring with other configs as well.

yamsgithub commented 6 years ago

I have the same issue. Training is consistently killed.

[localhost] local: wc -l /data/onebillion_split/train.tsv Reading data file.: 17%|##############1 | 582582/3506331 [02:43<19:10:00, 42.37it/s]Reading data file.: 17%|##############5 | 594704/3506331 [02:44<39:10, 1238.52it/s] Killed

yamsgithub commented 6 years ago

Looks like this is a memory issue. I ran on my cluster and it ran fine.

Vonzpf commented 6 years ago

@yamsgithub hello，do you config this project by running "run_docker.py"? Because some network reasons, I can not run it successfully. I install all packages one by one and encounter an issue about git like this:

Traceback (most recent call last): File "textmorph/edit_model/main.py", line 34, in exp = experiments.new(config) # new experiment from config File "/data/User/zpf/neural-editor/gtd/ml/training_run.py", line 145, in new run.record_commit(self._src_dir) File "/data/User/zpf/neural-editor/gtd/ml/training_run.py", line 66, in record_commit self.metadata['commit'] = repo.head.object.hexsha.encode('utf-8') File "/data/Development/anaconda/envs/docker/lib/python2.7/site-packages/git/refs/symbolic.py", line 193, in _get_object return Object.new_from_sha(self.repo, hex_to_bin(self.dereference_recursive(self.repo, self.path))) File "/data/Development/anaconda/envs/docker/lib/python2.7/site-packages/git/refs/symbolic.py", line 135, in dereference_recursive hexsha, ref_path = cls._get_ref_info(repo, ref_path) File "/data/Development/anaconda/envs/docker/lib/python2.7/site-packages/git/refs/symbolic.py", line 184, in _get_ref_info return cls._get_ref_info_helper(repo, ref_path) File "/data/Development/anaconda/envs/docker/lib/python2.7/site-packages/git/refs/symbolic.py", line 167, in _get_ref_info_helper raise ValueError("Reference at %r does not exist" % ref_path) ValueError: Reference at 'refs/heads/master' does not exist

It seems like path problem.However this issue is still existing after I create this master folder in refs/heads/

yamsgithub commented 6 years ago

@Vonzpf Yes. I am following the instructions as per the README and didn't have any issues. However without gpu the training has been running for 3 days now and is about 36% complete. So I would recommend using gpus. Hopefully it is faster. This is on the one billion text.

luciay commented 6 years ago

@yamsgithub did you load any other modules besides pytorch python when you ran the code on the cluster?

yamsgithub commented 6 years ago

@luciay I just used the docker which setup all the dependencies. I didn't have to install anything else except docker on my machine.

yamsgithub commented 6 years ago

@luciay if you are running on a cluster I would recommend creating a virtual environment and let the docker install all packages in that env.

Vonzpf commented 6 years ago

@yamsgithub Thank you! I had solved that problem luckily. This project need git to record the code's state. I initialize the repo at my folder "/neural-editor/", but I forgot to add and commit the code. So I just need using "git add ." and "git commit" at folder "/neural-editor/" to solve the problem.

JackLangerman commented 6 years ago

@yamsgithub I spoke with @luciay and she shared her batch script which runs on the prince cluster with Singularity instead of Docker on CPU. I then made some modifications so it runs with GPU on the Prince cluster. You can see my fork here -> https://github.com/JackLangerman/neural-editor

Hope this helps people!

kelvinguu / neural-editor

why training is always killed without any error information #13