Closed jeswan closed 4 years ago
Comment by pep8speaks Saturday Apr 11, 2020 at 03:17 GMT
Hello @pruksmhc! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
jiant/models.py
:Line 177:101: E501 line too long (103 > 100 characters) Line 706:17: W291 trailing whitespace Line 1215:80: W291 trailing whitespace Line 1216:41: W291 trailing whitespace
jiant/modules/simple_modules.py
:Line 201:101: E501 line too long (103 > 100 characters)
jiant/tasks/tasks.py
:Line 3779:101: E501 line too long (115 > 100 characters) Line 3783:101: E501 line too long (102 > 100 characters) Line 3819:101: E501 line too long (101 > 100 characters) Line 3820:101: E501 line too long (102 > 100 characters) Line 3821:80: W291 trailing whitespace Line 3883:21: E722 do not use bare 'except'
You can repair most issues by installing black and running: black -l 100 ./*
. If you contribute often, have a look at the 'Contributing' section of the README for instructions on doing this automatically.
Comment by sleepinyourhat Saturday Apr 11, 2020 at 15:56 GMT
@pruksmhc and I just got off a call, and it seems like there's a pretty deep problem here:
Comment by sleepinyourhat Saturday Apr 11, 2020 at 16:00 GMT
Follow-up about this PR: It may make sense to keep this PR more-or-less as is for now. If we don't add the pooler as a universal option, but we can add it here, that may help speed up the rate at which the SOP classifier is learned. The one thing we shouldn't do is initialize a new pooling layer anywhere: The layer is ~4096x4096, and that's too many parameters to try to train from scratch during intermediate training.
Comment by pruksmhc Saturday Apr 11, 2020 at 16:24 GMT
On your update this means keeping it separate for now as a Pairclassificationtask? Even though it is pair classification, right now the head is different enough to other pair classification tasks that even if we have it inherit from paurclassificaitontask we would still have to add in extra logic (as per the last PR I had put up on SOP) that is specific only for this task.
Comment by HaokunLiu Saturday Apr 11, 2020 at 17:59 GMT
@pruksmhc and I just got off a call, and it seems like there's a pretty deep problem here:
- For whole-input classification tasks, ALBERT passes the first hidden state (CLS) of the last layer through a pretrained Tanh layer before passing it to any task classifier, including the final classifier for SOP. That Tanh layer is not used for other types of tasks, like MLM or SQuAD, where separate outputs are needed for each token position. AFAIK, we don't generally implement or import that extra pretrained Tanh 'pooled_output' layer, so our implementations for MNLI/SST/etc. are incorrect.
- This should have been handled in #990 (@HaokunLiu), but it seems like nobody caught it. This is understandable—I couldn't find any mention of this setup in the ALBERT paper, but it does seem to be there if you follow the code, and it's referenced in some discussion on a Transformers issue (If you want to chase this down yourself, start here: https://github.com/google-research/albert/search?q=get_pooled_output&unscoped_q=get_pooled_output huggingface/transformers#2671). @HaokunLiu: Could you look into this a bit more fully and make an issue for this pooling layer?
- It'd be best to fix this, but it would probably mean a real non-trivial refactor of our Huggingface interface.
- In the mean time, we should figure out how much this hurts us. @pruksmhc @phu-pmh @HaokunLiu: Do any of you have a minute to do some comparisons now between our ALBERT and the ALBERT paper on some standard classification tasks? We'd have to make sure we're doing test vs. test or dev vs. dev. If it doesn't hurt us much, it may be okay to stick with our current setup for now, as long as we clearly warn users.
- Turning back to this PR, it doesn't look like the HuggingFace version of ALBERT gives us access to the final classifier layer for SOP. This means that we can't use SOP out of the box—it's only usable after some training. This is different from MLM, where we automatically load pretrained weights, so the model can do MLM out of the box. Mind updating this PR to document this very clearly for potential users?
- Also for this PR, the pretty clear takeaway is that SOP should be implemented the same way as other pair classification tasks. We need a new data loader, but we don't need custom model code. Please refactor.
Comment by sleepinyourhat Saturday Apr 11, 2020 at 18:45 GMT
@HaokunLiu Clarification Q: What are you referring to with "wider discussion from other papers"?
Comment by pruksmhc Saturday Apr 11, 2020 at 21:33 GMT
I can revert this task back to Pair classification task, but there will be some cons to that as well. We'll have to add yet another task specific if statement inside of the pair classification forward and build_pair_classification_module functions. @HaokunLiu @sleepinyourhat
Comment by HaokunLiu Saturday Apr 11, 2020 at 21:46 GMT
@HaokunLiu Clarification Q: What are you referring to with "wider discussion from other papers"?
"To tune or not to tune" found that the performance of diagnostic classifier at 8 ~ 9 has reach the performance at 12. Some mainly visualization paper I don't remember the title, found that parameter in the lower layers doesn't change much in the finetuning, while higher layers are where most of the changes happens.
They do not directly suggest adding another pretrained linear layer at the very end will not be beneficial. But it is fairly close to that.
Comment by HaokunLiu Saturday Apr 11, 2020 at 21:47 GMT
I can revert this task back to Pair classification task, but there will be some cons to that as well. We'll have to add yet another task specific if statement inside of the pair classification forward and build_pair_classification_module functions. @HaokunLiu @sleepinyourhat
Wait, I don't understand. What will the task-specific if statement do?
Comment by pruksmhc Saturday Apr 11, 2020 at 21:50 GMT
@HaokunLiu Well, right now Transformers based models have project_before_pooling=False for all, so we'd have to hard-code it to true for only SOP, and then load the SOP linear layer weights from huggingface to the pooler, which is currently not being done for the rest of the Sentence classification tasks (and I don't think we have bandwidth right now to run all the relevant test experiments to do that). Secondly, the linear layer is followed by a Tanh activation layer during inference time, so we'll have to add that too, since that's not standard in the PairClassification task forward function. I think at that point, I'd rather just create a separate SOPClassifier (like we have now in this PR) that is documented well and pretty simple/clear, and separate build_sop() and _sop_forward functions like now.
Comment by sleepinyourhat Saturday Apr 11, 2020 at 22:27 GMT
I can revert this task back to Pair classification task, but there will be some cons to that as well. We'll have to add yet another task specific if statement inside of the pair classification forward and build_pair_classification_module functions. @HaokunLiu @sleepinyourhat
We should revert the task class to inherit from a the normal pair classification task iff we use the same modeling code that we do for other pair classification tasks. I don't have strong feelings either way, but I think those two decisions should go together.
Comment by sleepinyourhat Saturday Apr 11, 2020 at 22:27 GMT
There are a bunch of comments on the old PR that are still valid here, BTW. Please go through those again too before asking for review next.
Comment by sleepinyourhat Saturday Apr 11, 2020 at 22:29 GMT
"To tune or not to tune" found that the performance of diagnostic classifier at 8 ~ 9 has reach the performance at 12. That's useful precedent. @HaokunLiu Could you add a link to this in the issue for future reference?
Comment by pruksmhc Sunday Apr 12, 2020 at 00:27 GMT
Alright, let's just stick with the current implementation, since there are differences between SOP and other PairClassificationTasks as of this moment.
Comment by sleepinyourhat Sunday Apr 12, 2020 at 00:31 GMT
Roger. It is arguably a weird inconsistency, so we should only do it (use the current impl.) if we're pretty sure it'll help rather than hurt. Probably a good idea to check in with the group.
Comment by pruksmhc Sunday Apr 12, 2020 at 02:27 GMT
@sleepinyourhat I went through the last PR and think I have all the comments addressed. A lot of the comments were on why we override the functions for the SOP task here I think is now pretty obvious, because it is because this SOP task inherits from Task and a lot of the functions are not implemented in the Task class.
Comment by pruksmhc Sunday Apr 12, 2020 at 19:39 GMT
Alright, I'll have it inherit from the PairClassificationTask, since the main differences between it and other pairclassification tasks now is in its classifier.
Comment by zphang Sunday Apr 12, 2020 at 21:38 GMT
Following up on my note on multiple sentences, here's my read of happens in BERT/ALBERT.
Comment by zphang Sunday Apr 12, 2020 at 21:56 GMT
Besides the NSP/SOP sentences issue above, and the in-memory list of lines, I don't see any areas of concern.
Comment by pyeres Wednesday Apr 15, 2020 at 13:49 GMT
Hi @phu-pmh & @pruksmhc — I see this is a "[WIP]". What remains to be done or reviewed?
Comment by phu-pmh Wednesday Apr 15, 2020 at 14:07 GMT
Hi @phu-pmh & @pruksmhc — I see this is a "[WIP]". What remains to be done or reviewed?
I just pushed the instructions for data generation. Other than that, nothing left to be done or reviewed, I think.
Comment by sleepinyourhat Wednesday Apr 15, 2020 at 17:01 GMT
Okay—my major concerns have been addressed, though I'd still prefer Phil do the final review.
Comment by pruksmhc Thursday Apr 16, 2020 at 15:38 GMT
@pyeres if you could take a look that would be great.
Comment by pyeres Thursday Apr 16, 2020 at 16:31 GMT
Okay—my major concerns have been addressed, though I'd still prefer Phil do the final review.
Per conversation with @sleepinyourhat — my review will be focused on SentenceOrderTask
.
Comment by pyeres Friday Apr 17, 2020 at 19:58 GMT
Capturing some notes from conversation with @pruksmhc out-of-band:
@phu-pmh, @sleepinyourhat, @zphang for visibility as contributors/reviewers on this PR — I'll plan to resume review once the docstrings and comments (mentioned above) are added.
Comment by sleepinyourhat Wednesday Apr 22, 2020 at 22:10 GMT
@pruksmhc @phu-pmh I talked through this PR with Phil earlier today, and I think it'll be ready to go with two more smallish fixes:
Try to get to these before you do anything else non-emergency on jiant, and we can get this merged in. Thanks!
Comment by pyeres Friday Apr 24, 2020 at 13:49 GMT
Below are the runs for MLM + SOP + intermediate task (which is how we intend to use SOP)
@pruksmhc — I'm not able to locate the run results you mentioned in the PR description (maybe they're buried somewhere else in the thread?). Please repost them in the PR description.
Issue by pruksmhc Saturday Apr 11, 2020 at 03:17 GMT Originally opened as https://github.com/nyu-mll/jiant/pull/1061
pruksmhc included the following code: https://github.com/nyu-mll/jiant/pull/1061/commits