Using multiple tokenizers in the same `exp_dir` can produce broken results without an error message

jeswan commented 4 years ago

Issue by sleepinyourhat Monday Jul 22, 2019 at 18:57 GMT Originally opened as https://github.com/nyu-mll/jiant/issues/862

Currently, if you run a bert-base-cased run in some exp_dir, then a bert-base-uncased run, you'll get this odd behavior, where the model reads the raw data again, but doesn't index it again:

07/22 02:54:40 PM: Loading tasks...
07/22 02:54:40 PM: Writing pre-preprocessed tasks to /Users/srbowman/exp/jiant-demo/
07/22 02:54:40 PM:  Creating task commitbank from scratch.
07/22 02:54:40 PM:  Loading Tokenizer bert-base-uncased
07/22 02:54:41 PM: loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /Users/srbowman/.cache/torch/pytorch_transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
07/22 02:54:41 PM:  Finished loading CommitmentBank data.
07/22 02:54:41 PM:  Task 'commitbank': |train|=250 |val|=56 |test|=250
07/22 02:54:41 PM:  Creating task mrpc from scratch.
07/22 02:54:46 PM:  Finished loading MRPC data.
07/22 02:54:46 PM:  Task 'mrpc': |train|=3668 |val|=408 |test|=1725
07/22 02:54:46 PM:  Creating task sst from scratch.
07/22 02:55:00 PM:  Finished loading SST data.
07/22 02:55:00 PM:  Task 'sst': |train|=67349 |val|=872 |test|=1821
07/22 02:55:00 PM:  Creating task sts-b from scratch.
07/22 02:55:03 PM:  Finished loading STS Benchmark data.
07/22 02:55:03 PM:  Task 'sts-b': |train|=5749 |val|=1500 |test|=1379
07/22 02:55:03 PM:  Finished loading tasks: commitbank mrpc sst sts-b.
07/22 02:55:03 PM: Loading token dictionary from /Users/srbowman/exp/jiant-demo/vocab.
07/22 02:55:03 PM:  Loaded vocab from /Users/srbowman/exp/jiant-demo/vocab
07/22 02:55:03 PM:  Vocab namespace chars: size 105
07/22 02:55:03 PM:  Vocab namespace tokens: size 1004
07/22 02:55:03 PM:  Vocab namespace xlnet-base-cased: size 32002
07/22 02:55:03 PM:  Finished building vocab.
07/22 02:55:03 PM:  Task 'commitbank', split 'train': Found preprocessed copy in /Users/srbowman/exp/jiant-demo/preproc/commitbank__train_data
07/22 02:55:03 PM:  Task 'commitbank', split 'val': Found preprocessed copy in /Users/srbowman/exp/jiant-demo/preproc/commitbank__val_data
07/22 02:55:03 PM:  Task 'commitbank', split 'test': Found preprocessed copy in /Users/srbowman/exp/jiant-demo/preproc/commitbank__test_data
07/22 02:55:03 PM:  Task 'mrpc', split 'train': Found preprocessed copy in /Users/srbowman/exp/jiant-demo/preproc/mrpc__train_data
07/22 02:55:03 PM:  Task 'mrpc', split 'val': Found preprocessed copy in /Users/srbowman/exp/jiant-demo/preproc/mrpc__val_data
07/22 02:55:03 PM:  Task 'mrpc', split 'test': Found preprocessed copy in /Users/srbowman/exp/jiant-demo/preproc/mrpc__test_data
07/22 02:55:03 PM:  Task 'sst', split 'train': Found preprocessed copy in /Users/srbowman/exp/jiant-demo/preproc/sst__train_data
07/22 02:55:03 PM:  Task 'sst', split 'val': Found preprocessed copy in /Users/srbowman/exp/jiant-demo/preproc/sst__val_data
07/22 02:55:03 PM:  Task 'sst', split 'test': Found preprocessed copy in /Users/srbowman/exp/jiant-demo/preproc/sst__test_data
07/22 02:55:03 PM:  Task 'sts-b', split 'train': Found preprocessed copy in /Users/srbowman/exp/jiant-demo/preproc/sts-b__train_data
07/22 02:55:03 PM:  Task 'sts-b', split 'val': Found preprocessed copy in /Users/srbowman/exp/jiant-demo/preproc/sts-b__val_data
07/22 02:55:03 PM:  Task 'sts-b', split 'test': Found preprocessed copy in /Users/srbowman/exp/jiant-demo/preproc/sts-b__test_data
07/22 02:55:03 PM:  Finished indexing tasks
07/22 02:55:03 PM:  Creating trimmed target-only version of commitbank train.
07/22 02:55:03 PM:  Creating trimmed pretraining-only version of mrpc train.
07/22 02:55:03 PM:  Creating trimmed pretraining-only version of sst train.
07/22 02:55:03 PM:  Creating trimmed target-only version of sts-b train.

This looks pretty clearly wrong, since the two runs have different vocabularies, but I'm not sure what direction to go for a fix. Are we reasonably close to being able to support multiple tokenizers in the same exp_dir, such that we should fix this, or are we far enough away that we should simply add asserts to make sure that this fails?

I'll convert this to a normal bug issue once I have a better sense of the problem.

jeswan commented 4 years ago

Comment by iftenney Monday Jul 22, 2019 at 20:57 GMT

I don't know how far we are from supporting this, but last time I looked in to it there were a lot of places where we assumed a single tokenization, and I had put in some asserts to make sure this was the case.

jeswan commented 4 years ago

Comment by pruksmhc Tuesday Jul 23, 2019 at 03:14 GMT

I believe we are pretty close to supporting this (I can see it being as simple as changing the file names in the cache directories to include the tokenizer name). However, the question is if we should support it, since the way that we've been using exp_name is with each exp_dir having one type of tokenization.

jeswan commented 4 years ago

Comment by sleepinyourhat Tuesday Jul 23, 2019 at 15:47 GMT

Yep, this seems to be causing unexpectedly low performance in some runs. I'm guessing that this has to do with the fact that many of the big transformer models use the same vocab size, so you'll never hit index out of range errors. Instead, you'll just use the wrong vocabulary and feed nonsense indices into the model.

Does anyone know this code well enough to figure out how to add some asserts?

jeswan commented 4 years ago

Comment by pruksmhc Tuesday Jul 23, 2019 at 16:19 GMT

I'll take a crack on this. So just to be clear we're enforcing our paradigm of one tokenization per experiment directory?

jeswan commented 4 years ago

Comment by sleepinyourhat Tuesday Jul 23, 2019 at 16:59 GMT

Up to you. If it's easy to allow multiple, go for it. Otherwise, just add some asserts.

jeswan commented 4 years ago

Comment by iftenney Friday Jul 26, 2019 at 14:11 GMT

Posted this on #866, but this seems a better place:

What's the rationale for wanting different tokenizations within the same exp_dir? I thought exp_dir existed to allow sharing of preprocessing data, but that all assumes a particular tokenization anyway so it seems like there's no benefit if we relax that constraint.

The scenario initially described (do a run with tokenizer A, then re-use exp dir with tokenizer B) seems like a case for throwing a hard error.

jeswan commented 4 years ago

Comment by sleepinyourhat Friday Jul 26, 2019 at 15:39 GMT

The common-sense motivation for exp_dir, and the one that we give in defaults.conf, is to group runs belonging to a single experimental setup. That'll usually involve the same code version, the same evaluation data, and at least roughly comparable models. But it's pretty normal to run experiments where you swap out input layers.

So, we need to either support that—which would mean storing multiple copies of each index file when using multiple tokenizers—or make it really clear that we don't support that.

jeswan commented 4 years ago

Comment by W4ngatang Friday Jul 26, 2019 at 21:14 GMT

My understanding of exp_dir is what Ian said, though it's probably not named or explained clearly. Maybe we should change exp_dir -> preproc/data_cache_dir and otherwise get rid of exp_dir.

That still seems pretty hairy, though, with non-fixed vocabulary (e.g. using GloVe vectors but not something like BERT). We would likely want to recompute the vocab for task B, after experimenting on task A, but then if we want to go back to task A, we'd need to recompute again...

Best option seems to be to just check the preproc / vocab that already exists in a (renamed) exp_dir and fail if it doesn't match the tokenization we're using.

jeswan commented 4 years ago

Comment by pruksmhc Wednesday Jul 31, 2019 at 13:43 GMT

I think it's fine to keep the one preprocessing per exp_dir without name changing, but to make it clearer in the document as well as with assertions.

nyu-mll / jiant-v1-legacy

Using multiple tokenizers in the same `exp_dir` can produce broken results without an error message #862