[CLOSED] Adding Sentence Order Prediction

jeswan commented 4 years ago

Issue by pruksmhc Saturday Apr 11, 2020 at 03:17 GMT Originally opened as https://github.com/nyu-mll/jiant/pull/1061

Adding Sentence Order Prediction Task for ALBERT What this version of MLM supports: ALBERT embedder. Below are the runs for MLM + SOP + intermediate task (which is how we intend to use SOP) The results (on ALBERT) are:	Experiment description	Intermediate Task Performance	SOP performance
SST + SOP	0.85	0.95
QQP + SOP	0.95	0.94

pruksmhc included the following code: https://github.com/nyu-mll/jiant/pull/1061/commits

jeswan commented 4 years ago

Comment by pep8speaks Saturday Apr 11, 2020 at 03:17 GMT

Hello @pruksmhc! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file jiant/models.py:

Line 177:101: E501 line too long (103 > 100 characters) Line 706:17: W291 trailing whitespace Line 1215:80: W291 trailing whitespace Line 1216:41: W291 trailing whitespace

In the file jiant/modules/simple_modules.py:

Line 201:101: E501 line too long (103 > 100 characters)

In the file jiant/tasks/tasks.py:

Line 3779:101: E501 line too long (115 > 100 characters) Line 3783:101: E501 line too long (102 > 100 characters) Line 3819:101: E501 line too long (101 > 100 characters) Line 3820:101: E501 line too long (102 > 100 characters) Line 3821:80: W291 trailing whitespace Line 3883:21: E722 do not use bare 'except'

You can repair most issues by installing black and running: black -l 100 ./*. If you contribute often, have a look at the 'Contributing' section of the README for instructions on doing this automatically.

Comment last updated at 2020-04-23 18:23:36 UTC

jeswan commented 4 years ago

Comment by sleepinyourhat Saturday Apr 11, 2020 at 15:56 GMT

@pruksmhc and I just got off a call, and it seems like there's a pretty deep problem here:

For whole-input classification tasks, ALBERT passes the first hidden state (CLS) of the last layer through a pretrained Tanh layer before passing it to any task classifier, including the final classifier for SOP. That Tanh layer is not used for other types of tasks, like MLM or SQuAD, where separate outputs are needed for each token position. AFAIK, we don't generally implement or import that extra pretrained Tanh 'pooled_output' layer, so our implementations for MNLI/SST/etc. are incorrect.
This should have been handled in #990 (@HaokunLiu), but it seems like nobody caught it. This is understandable—I couldn't find any mention of this setup in the ALBERT paper, but it does seem to be there if you follow the code, and it's referenced in some discussion on a Transformers issue (If you want to chase this down yourself, start here: https://github.com/google-research/albert/search?q=get_pooled_output&unscoped_q=get_pooled_output https://github.com/huggingface/transformers/issues/2671). @HaokunLiu: Could you look into this a bit more fully and make an issue for this pooling layer?
It'd be best to fix this, but it would probably mean a real non-trivial refactor of our Huggingface interface.
In the mean time, we should figure out how much this hurts us. @pruksmhc @phu-pmh @HaokunLiu: Do any of you have a minute to do some comparisons now between our ALBERT and the ALBERT paper on some standard classification tasks? We'd have to make sure we're doing test vs. test or dev vs. dev. If it doesn't hurt us much, it may be okay to stick with our current setup for now, as long as we clearly warn users.
Turning back to this PR, it doesn't look like the HuggingFace version of ALBERT gives us access to the final classifier layer for SOP. This means that we can't use SOP out of the box—it's only usable after some training. This is different from MLM, where we automatically load pretrained weights, so the model can do MLM out of the box. Mind updating this PR to document this very clearly for potential users?
Also for this PR, the pretty clear takeaway is that SOP should be implemented the same way as other pair classification tasks. We need a new data loader, but we don't need custom model code. Please refactor.

jeswan commented 4 years ago

Comment by sleepinyourhat Saturday Apr 11, 2020 at 16:00 GMT

Follow-up about this PR: It may make sense to keep this PR more-or-less as is for now. If we don't add the pooler as a universal option, but we can add it here, that may help speed up the rate at which the SOP classifier is learned. The one thing we shouldn't do is initialize a new pooling layer anywhere: The layer is ~4096x4096, and that's too many parameters to try to train from scratch during intermediate training.

jeswan commented 4 years ago

Comment by pruksmhc Saturday Apr 11, 2020 at 16:24 GMT

On your update this means keeping it separate for now as a Pairclassificationtask? Even though it is pair classification, right now the head is different enough to other pair classification tasks that even if we have it inherit from paurclassificaitontask we would still have to add in extra logic (as per the last PR I had put up on SOP) that is specific only for this task.

jeswan commented 4 years ago

Comment by HaokunLiu Saturday Apr 11, 2020 at 17:59 GMT

@pruksmhc and I just got off a call, and it seems like there's a pretty deep problem here:

For whole-input classification tasks, ALBERT passes the first hidden state (CLS) of the last layer through a pretrained Tanh layer before passing it to any task classifier, including the final classifier for SOP. That Tanh layer is not used for other types of tasks, like MLM or SQuAD, where separate outputs are needed for each token position. AFAIK, we don't generally implement or import that extra pretrained Tanh 'pooled_output' layer, so our implementations for MNLI/SST/etc. are incorrect.

This should have been handled in #990 (@HaokunLiu), but it seems like nobody caught it. This is understandable—I couldn't find any mention of this setup in the ALBERT paper, but it does seem to be there if you follow the code, and it's referenced in some discussion on a Transformers issue (If you want to chase this down yourself, start here: https://github.com/google-research/albert/search?q=get_pooled_output&unscoped_q=get_pooled_output huggingface/transformers#2671). @HaokunLiu: Could you look into this a bit more fully and make an issue for this pooling layer?

It'd be best to fix this, but it would probably mean a real non-trivial refactor of our Huggingface interface.

In the mean time, we should figure out how much this hurts us. @pruksmhc @phu-pmh @HaokunLiu: Do any of you have a minute to do some comparisons now between our ALBERT and the ALBERT paper on some standard classification tasks? We'd have to make sure we're doing test vs. test or dev vs. dev. If it doesn't hurt us much, it may be okay to stick with our current setup for now, as long as we clearly warn users.

Turning back to this PR, it doesn't look like the HuggingFace version of ALBERT gives us access to the final classifier layer for SOP. This means that we can't use SOP out of the box—it's only usable after some training. This is different from MLM, where we automatically load pretrained weights, so the model can do MLM out of the box. Mind updating this PR to document this very clearly for potential users?

Also for this PR, the pretty clear takeaway is that SOP should be implemented the same way as other pair classification tasks. We need a new data loader, but we don't need custom model code. Please refactor.

About the linear layer in pooling. Yes. That seems to be the case. They have this pooling layer on top of the first embedding out of the transformer. But we didn't use it in our implementation.
Having a single linear layer pretrained before the classifier seems to be an odd but to some extend understandable design. Maybe they think even this layer can benefit from pretraining.
Our results does not fall short against theirs, some times even better. And according to wider discussion from other papers, parameters on the top can not be as easily transferred from pretrain to target task. My guess is, this does not hurt us in any significant way.
If possible, I would incline to stay with out current setting.
About this SOP PR itself, I agree SOP is not different from sentence pair classification. Pooling layer is not a difference from SOP, but from how ALBERT people think classification should be done.

jeswan commented 4 years ago

Comment by sleepinyourhat Saturday Apr 11, 2020 at 18:45 GMT

@HaokunLiu Clarification Q: What are you referring to with "wider discussion from other papers"?

jeswan commented 4 years ago

Comment by pruksmhc Saturday Apr 11, 2020 at 21:33 GMT

I can revert this task back to Pair classification task, but there will be some cons to that as well. We'll have to add yet another task specific if statement inside of the pair classification forward and build_pair_classification_module functions. @HaokunLiu @sleepinyourhat

jeswan commented 4 years ago

Comment by HaokunLiu Saturday Apr 11, 2020 at 21:46 GMT

@HaokunLiu Clarification Q: What are you referring to with "wider discussion from other papers"?

"To tune or not to tune" found that the performance of diagnostic classifier at 8 ~ 9 has reach the performance at 12. Some mainly visualization paper I don't remember the title, found that parameter in the lower layers doesn't change much in the finetuning, while higher layers are where most of the changes happens.

They do not directly suggest adding another pretrained linear layer at the very end will not be beneficial. But it is fairly close to that.

jeswan commented 4 years ago

Comment by HaokunLiu Saturday Apr 11, 2020 at 21:47 GMT

I can revert this task back to Pair classification task, but there will be some cons to that as well. We'll have to add yet another task specific if statement inside of the pair classification forward and build_pair_classification_module functions. @HaokunLiu @sleepinyourhat

Wait, I don't understand. What will the task-specific if statement do?

jeswan commented 4 years ago

Comment by pruksmhc Saturday Apr 11, 2020 at 21:50 GMT

@HaokunLiu Well, right now Transformers based models have project_before_pooling=False for all, so we'd have to hard-code it to true for only SOP, and then load the SOP linear layer weights from huggingface to the pooler, which is currently not being done for the rest of the Sentence classification tasks (and I don't think we have bandwidth right now to run all the relevant test experiments to do that). Secondly, the linear layer is followed by a Tanh activation layer during inference time, so we'll have to add that too, since that's not standard in the PairClassification task forward function. I think at that point, I'd rather just create a separate SOPClassifier (like we have now in this PR) that is documented well and pretty simple/clear, and separate build_sop() and _sop_forward functions like now.

jeswan commented 4 years ago

Comment by sleepinyourhat Saturday Apr 11, 2020 at 22:27 GMT

I can revert this task back to Pair classification task, but there will be some cons to that as well. We'll have to add yet another task specific if statement inside of the pair classification forward and build_pair_classification_module functions. @HaokunLiu @sleepinyourhat

We should revert the task class to inherit from a the normal pair classification task iff we use the same modeling code that we do for other pair classification tasks. I don't have strong feelings either way, but I think those two decisions should go together.

jeswan commented 4 years ago

Comment by sleepinyourhat Saturday Apr 11, 2020 at 22:27 GMT

There are a bunch of comments on the old PR that are still valid here, BTW. Please go through those again too before asking for review next.

jeswan commented 4 years ago

Comment by sleepinyourhat Saturday Apr 11, 2020 at 22:29 GMT

"To tune or not to tune" found that the performance of diagnostic classifier at 8 ~ 9 has reach the performance at 12. That's useful precedent. @HaokunLiu Could you add a link to this in the issue for future reference?

jeswan commented 4 years ago

Comment by pruksmhc Sunday Apr 12, 2020 at 00:27 GMT

Alright, let's just stick with the current implementation, since there are differences between SOP and other PairClassificationTasks as of this moment.

jeswan commented 4 years ago

Comment by sleepinyourhat Sunday Apr 12, 2020 at 00:31 GMT

Roger. It is arguably a weird inconsistency, so we should only do it (use the current impl.) if we're pretty sure it'll help rather than hurt. Probably a good idea to check in with the group.

jeswan commented 4 years ago

Comment by pruksmhc Sunday Apr 12, 2020 at 02:27 GMT

@sleepinyourhat I went through the last PR and think I have all the comments addressed. A lot of the comments were on why we override the functions for the SOP task here I think is now pretty obvious, because it is because this SOP task inherits from Task and a lot of the functions are not implemented in the Task class.

jeswan commented 4 years ago

Comment by pruksmhc Sunday Apr 12, 2020 at 02:28 GMT

We should have the validation runs tomorrow

jeswan commented 4 years ago

Comment by pruksmhc Sunday Apr 12, 2020 at 19:39 GMT

Alright, I'll have it inherit from the PairClassificationTask, since the main differences between it and other pairclassification tasks now is in its classifier.

jeswan commented 4 years ago

Comment by zphang Sunday Apr 12, 2020 at 21:38 GMT

Following up on my note on multiple sentences, here's my read of happens in BERT/ALBERT.

BERT and ALBERT seem to share the same/similar preprocessing code. BERT. ALBERT
The original input files consist of one sentence per line, and an empty line to delineate a document boundary. (This prevents taking the last sentence of one document and the first sentence of the next document as a valid pair).
However, in constructing the training instances, they actually do take multiple consecutive sentences as part of "sentence a" or "sentence b". Let's call these SequenceA and SequenceB for clarity. The logic is surprisingly complex, but basically they do load multiple sentences for both SequenceA and Sequence B. See the above links to BERT and ALBERT code.

jeswan commented 4 years ago

Comment by zphang Sunday Apr 12, 2020 at 21:56 GMT

Besides the NSP/SOP sentences issue above, and the in-memory list of lines, I don't see any areas of concern.

jeswan commented 4 years ago

Comment by pyeres Wednesday Apr 15, 2020 at 13:49 GMT

Hi @phu-pmh & @pruksmhc — I see this is a "[WIP]". What remains to be done or reviewed?

jeswan commented 4 years ago

Comment by phu-pmh Wednesday Apr 15, 2020 at 14:07 GMT

Hi @phu-pmh & @pruksmhc — I see this is a "[WIP]". What remains to be done or reviewed?

I just pushed the instructions for data generation. Other than that, nothing left to be done or reviewed, I think.

jeswan commented 4 years ago

Comment by sleepinyourhat Wednesday Apr 15, 2020 at 17:01 GMT

Okay—my major concerns have been addressed, though I'd still prefer Phil do the final review.

jeswan commented 4 years ago

Comment by pruksmhc Thursday Apr 16, 2020 at 15:38 GMT

@pyeres if you could take a look that would be great.

jeswan commented 4 years ago

Comment by pyeres Thursday Apr 16, 2020 at 16:31 GMT

Okay—my major concerns have been addressed, though I'd still prefer Phil do the final review.

Per conversation with @sleepinyourhat — my review will be focused on SentenceOrderTask.

jeswan commented 4 years ago

Comment by pyeres Friday Apr 17, 2020 at 19:58 GMT

Capturing some notes from conversation with @pruksmhc out-of-band:

There are intentional differences between our implementation of SOP and the TF implementation — inline code comments will be added to call out these differences.
The docstrings will be updated to fully outline the version of the SOP implemented here.
@pruksmhc considers the SOP implementation to have been reviewed for correctness (already, by other reviewers).

@phu-pmh, @sleepinyourhat, @zphang for visibility as contributors/reviewers on this PR — I'll plan to resume review once the docstrings and comments (mentioned above) are added.

jeswan commented 4 years ago

Comment by sleepinyourhat Wednesday Apr 22, 2020 at 22:10 GMT

@pruksmhc @phu-pmh I talked through this PR with Phil earlier today, and I think it'll be ready to go with two more smallish fixes:

Fix the docstring issue mentioned here: https://github.com/nyu-mll/jiant/pull/1061#discussion_r409929236
Fix the credit assignment issue for the sections of code that use snippets of ALBERT code.

Try to get to these before you do anything else non-emergency on jiant, and we can get this merged in. Thanks!

jeswan commented 4 years ago

Comment by pyeres Friday Apr 24, 2020 at 13:49 GMT

Below are the runs for MLM + SOP + intermediate task (which is how we intend to use SOP)

@pruksmhc — I'm not able to locate the run results you mentioned in the PR description (maybe they're buried somewhere else in the thread?). Please repost them in the PR description.

nyu-mll / jiant-v1-legacy

[CLOSED] Adding Sentence Order Prediction #1061

Comment last updated at 2020-04-23 18:23:36 UTC