Closed jeswan closed 4 years ago
Comment by pyeres Tuesday Mar 03, 2020 at 14:51 GMT
Thanks @pruksmhc — two requests for information in the PR description:
cuda
config field?)Comment by pyeres Tuesday Mar 03, 2020 at 23:45 GMT
PyTorch magic: looks like the weird exception you were seeing can be addressed by wrapping the lists of modules (under the projs
and span_extractors
fields) in a nn.ModuleList()
like this:
if self.is_symmetric or self.single_sided:
# Use None as dummy padding for readability,
# so that we can index projs[1] and projs[2]
self.projs = nn.ModuleList([None, self.proj1, self.proj1])
else:
# Separate params for span2
self.proj2 = self._make_cnn_layer(d_inp)
self.projs = nn.ModuleList([None, self.proj1, self.proj2])
# Span extractor, shared for both span1 and span2.
self.span_extractor1 = self._make_span_extractor()
if self.is_symmetric or self.single_sided:
self.span_extractors = nn.ModuleList([None, self.span_extractor1, self.span_extractor1])
else:
self.span_extractor2 = self._make_span_extractor()
self.span_extractors = nn.ModuleList([None, self.span_extractor1, self.span_extractor2])
After that the forward
needs to be updated, but it looks like it can run under multi-GPU.
Comment by pruksmhc Wednesday Mar 04, 2020 at 19:10 GMT
Comment by pyeres Thursday Mar 05, 2020 at 15:56 GMT
- This fix will allow edge probing tasks to work in a single-GPU case. I only tested on a machine with one GPU. However, if there are multiple GPUs and only one GPU iD is specified, it should be the case that only 1 GPU is used (and DataParallel is not used). If that is not the case, that is case for a more major investigation on our DataParallel implementation.
- I ran this fix for edges-dpr. All the probing datasets have the same API (two sentences, a multi-classification task), and passes through the same code (which is what I changed in this PR), thus I only tested for one task. I can run some more validation checks on other edge probing tasks.
@pruksmhc — thanks for info!
I'm planning to open a (WIP) PR on this branch today to extend your fix to cover multi-GPU.
Comment by pyeres Thursday Mar 05, 2020 at 19:40 GMT
I'm planning to open a (WIP) PR on this branch today to extend your fix to cover multi-GPU.
@pruksmhc — I extended your fix to cover multi-GPU mode (w/ PR #1029). If my changes look OK please merge them into this branch and I'll plan to validate our changes together by comparing:
jiant
1.2.1Comment by pyeres Friday Mar 06, 2020 at 22:19 GMT
@pruksmhc — I extended your fix to cover multi-GPU mode (w/ PR #1029). If my changes look OK please merge them into this branch and I'll plan to validate our changes together by comparing:
- Edge probing experiment run w/ single-GPU using
jiant
1.2.1- Edge probing experiment run w/ single-GPU using this branch
- Edge probing experiment run w/ multi-GPU using this branch
fyi, here are results on edges-ner-ontonotes
* showing consistent performance w/ these changes:
version | GPUs | macro avg. | mcc | acc. | precision | recall | f1 |
---|---|---|---|---|---|---|---|
1.2.1 | 1 | 0.972 | 0.970 | 0.970 | 0.973 | 0.971 | 0.972 |
1.3.0 @ 0c3d128 | 1 | 0.971 | 0.969 | 0.969 | 0.971 | 0.970 | 0.971 |
1.3.0 @ 0c3d128 | 2 | 0.971 | 0.969 | 0.969 | 0.972 | 0.970 | 0.971 |
Please let me know if there are any other experiments I can help run.
* using base_edgeprobing.conf
w/ overrides batch_size=4, reload_vocab=1, do_pretrain=0, do_target_task_training=1, do_full_eval=1, lr=5e-6, dropout=0.1, random_seed=42, input_module=roberta-large, pretrain_tasks=none, target_tasks=edges-ner-ontonotes
Comment by pruksmhc Sunday Mar 08, 2020 at 12:49 GMT
Merged! Also, I did smoke tests on all edge-probing and Senteval tasks on single-GPU @pyeres, so I think all that should be done is doing smoke tests for multi-GPU. Given that we've already done extensive performance validation tests, I don't think for the sake of time we need to do them for all of edge probing and Senteval.
Comment by pyeres Monday Mar 09, 2020 at 17:20 GMT
Merged! Also, I did smoke tests on all edge-probing and Senteval tasks on single-GPU @pyeres, so I think all that should be done is doing smoke tests for multi-GPU. Given that we've already done extensive performance validation tests, I don't think for the sake of time we need to do them for all of edge probing and Senteval.
Great, thanks @pruksmhc — I'll kick off runs for edge probing tasks w/ multi-GPU. Can you show me an example of the kind of run config you used for Senteval tasks? (I'll copy it for multi-GPU).
Comment by pruksmhc Monday Mar 09, 2020 at 17:38 GMT
export TM_PROBING_TASK_NAMES=(edges-spr1 edges-spr2 edges-dpr se-probing-bigram-shift) for i in "${TM_PROBING_TASK_NAMES[@]}" do python main.py --config_file "jiant/config/taskmaster/base_roberta.conf" --overrides "max_vals=0, exp_name=edges-check, batch_size=2, reload_vocab=1, do_pretrain=0, do_target_task_training=1, do_full_eval=1, lr=5e-6, dropout=0.1, random_seed=42, input_module=roberta-large, pretrain_tasks=$i ,target_tasks=$i, run_name=$i, val_interval=100,target_train_val_interval=50, max_vals=1, target_train_max_vals=1" done ~
Comment by pyeres Tuesday Mar 10, 2020 at 12:48 GMT
export TM_PROBING_TASK_NAMES=(edges-spr1 edges-spr2 edges-dpr se-probing-bigram-shift) for i in "${TM_PROBING_TASK_NAMES[@]}" do python main.py --config_file "jiant/config/taskmaster/base_roberta.conf" --overrides "max_vals=0, exp_name=edges-check, batch_size=2, reload_vocab=1, do_pretrain=0, do_target_task_training=1, do_full_eval=1, lr=5e-6, dropout=0.1, random_seed=42, input_module=roberta-large, pretrain_tasks=$i ,target_tasks=$i, run_name=$i, val_interval=100,target_train_val_interval=50, max_vals=1, target_train_max_vals=1" done ~
Thanks, @pruksmhc — confirmed multi-GPU validation runs don't show regression.
Issue by pruksmhc Monday Mar 02, 2020 at 22:29 GMT Originally opened as https://github.com/nyu-mll/jiant/pull/1025
https://github.com/nyu-mll/jiant/issues/1017 Modifying edge-probing to fit into new update_metrics function signature. There's still a weird error going on with edgeprobing on Multi-GPU so I raised an error.
pruksmhc included the following code: https://github.com/nyu-mll/jiant/pull/1025/commits