Open jeswan opened 4 years ago
Comment by iftenney Monday Jul 22, 2019 at 20:57 GMT
I don't know how far we are from supporting this, but last time I looked in to it there were a lot of places where we assumed a single tokenization, and I had put in some asserts to make sure this was the case.
Comment by pruksmhc Tuesday Jul 23, 2019 at 03:14 GMT
I believe we are pretty close to supporting this (I can see it being as simple as changing the file names in the cache directories to include the tokenizer name). However, the question is if we should support it, since the way that we've been using exp_name is with each exp_dir having one type of tokenization.
Comment by sleepinyourhat Tuesday Jul 23, 2019 at 15:47 GMT
Yep, this seems to be causing unexpectedly low performance in some runs. I'm guessing that this has to do with the fact that many of the big transformer models use the same vocab size, so you'll never hit index out of range errors. Instead, you'll just use the wrong vocabulary and feed nonsense indices into the model.
Does anyone know this code well enough to figure out how to add some asserts?
Comment by pruksmhc Tuesday Jul 23, 2019 at 16:19 GMT
I'll take a crack on this. So just to be clear we're enforcing our paradigm of one tokenization per experiment directory?
Comment by sleepinyourhat Tuesday Jul 23, 2019 at 16:59 GMT
Up to you. If it's easy to allow multiple, go for it. Otherwise, just add some asserts.
Comment by iftenney Friday Jul 26, 2019 at 14:11 GMT
Posted this on #866, but this seems a better place:
What's the rationale for wanting different tokenizations within the same exp_dir? I thought exp_dir existed to allow sharing of preprocessing data, but that all assumes a particular tokenization anyway so it seems like there's no benefit if we relax that constraint.
The scenario initially described (do a run with tokenizer A, then re-use exp dir with tokenizer B) seems like a case for throwing a hard error.
Comment by sleepinyourhat Friday Jul 26, 2019 at 15:39 GMT
The common-sense motivation for exp_dir, and the one that we give in defaults.conf, is to group runs belonging to a single experimental setup. That'll usually involve the same code version, the same evaluation data, and at least roughly comparable models. But it's pretty normal to run experiments where you swap out input layers.
So, we need to either support that—which would mean storing multiple copies of each index file when using multiple tokenizers—or make it really clear that we don't support that.
Comment by W4ngatang Friday Jul 26, 2019 at 21:14 GMT
My understanding of exp_dir
is what Ian said, though it's probably not named or explained clearly. Maybe we should change exp_dir
-> preproc/data_cache_dir
and otherwise get rid of exp_dir
.
That still seems pretty hairy, though, with non-fixed vocabulary (e.g. using GloVe vectors but not something like BERT). We would likely want to recompute the vocab for task B, after experimenting on task A, but then if we want to go back to task A, we'd need to recompute again...
Best option seems to be to just check the preproc / vocab that already exists in a (renamed) exp_dir
and fail if it doesn't match the tokenization we're using.
Issue by sleepinyourhat Monday Jul 22, 2019 at 18:57 GMT Originally opened as https://github.com/nyu-mll/jiant/issues/862
Currently, if you run a bert-base-cased run in some exp_dir, then a bert-base-uncased run, you'll get this odd behavior, where the model reads the raw data again, but doesn't index it again:
This looks pretty clearly wrong, since the two runs have different vocabularies, but I'm not sure what direction to go for a fix. Are we reasonably close to being able to support multiple tokenizers in the same exp_dir, such that we should fix this, or are we far enough away that we should simply add asserts to make sure that this fails?
I'll convert this to a normal bug issue once I have a better sense of the problem.