Closed PonteIneptique closed 1 year ago
Could you share your training arguments?
On Sun, Jan 29, 2023 at 10:08 AM Thibault Clérice @.***> wrote:
As a follow-up to #420 https://github.com/mittagessen/kraken/issues/420, I wanted to check what was going on.
Recap:
Between two different version of kraken, but mostly with different version of the deps, I lost 6-10 points of % on dev.
I saw you committed a potential fix here 33c1875 https://github.com/mittagessen/kraken/commit/33c1875d6f438cbb38ec10ff100586bcc0cd1d81
I just launched a check experiment to see if we reach the same dev score with the fix, and thought a separate issue would be better (as ReduceLRonplateau still crashes)
— Reply to this email directly, view it on GitHub https://github.com/mittagessen/kraken/issues/421, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZKGROOAPF4JZZOBBCJ3VDWUYXPLANCNFSM6AAAAAAUKEGWME . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Yes, please see if that optimizer return format change fixes things. Or more importantly if downgrading pytorch-lightning reverses the dropped scores.
please see if that optimizer return format change fixes things.
Lightning documentation says that ReduceLROnPlateau requires a monitor:
# The ReduceLROnPlateau scheduler requires a monitordef
configure_optimizers(self):
optimizer = Adam(...)
return {
"optimizer": optimizer,
"lr_scheduler": {
"scheduler": ReduceLROnPlateau(optimizer, ...),
"monitor": "metric_to_track",
"frequency": "indicates how often the metric is updated"
# If "monitor" references validation metrics, then
"frequency" should be set to a
# multiple of "trainer.check_val_every_n_epoch".
},
}
On Sun, Jan 29, 2023 at 10:57 AM mittagessen @.***> wrote:
Yes, please see if that optimizer return format change fixes things.
— Reply to this email directly, view it on GitHub https://github.com/mittagessen/kraken/issues/421#issuecomment-1407616806, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZKGRI3GWP3KAR4BWXLYTDWUY5KJANCNFSM6AAAAAAUKEGWME . You are receiving this because you commented.Message ID: @.***>
I opened the issue to keep track of the specific scoring issue.
Here is the command
env/bin/ketos train -u NFD --device cuda:0 --augment -f binary -t ./manifest-A-train.txt -e manifest-A-dev.txt -B 16 --fixed-splits --lrate 1e-3 -o model-TEST --spec "[1,120,0,1 Cr4,2,32,4,2 Gn32 Cr4,2,64,1,1 Gn32 Mp4,2,4,2 Cr3,3,128,1,1 Gn32 Mp1,2,1,2 S1(1x0)1,3 Lbx256 Do0.5 Lbx256 Do0.5 Lbx256 Do0.5]" --lag 15
This command does drop significantly (7 points, 82ish vs 89ish on dev) on master from today (so your fix @mittagessen does not change things).
I usually use --lrate 0.0001 if -B =1 for this type of mode. If you have a batchrate higher than 1 did you try to increase the lrate by sqroot(B)?, i.e. 0.0004 in your instance?
The same command, except for reduceonplateau, on a previous environment does reacher 7 points higher accuracy. I doubt it's related to anything else than either reduceonplateau (which Ben doubts is the case) or a dependency gone rogue.
On Sun, 29 Jan 2023, 10:57 am Colin Brisson, @.***> wrote:
Could you share your training arguments?
On Sun, Jan 29, 2023 at 10:08 AM Thibault Clérice @.***> wrote:
As a follow-up to #420 <https://github.com/mittagessen/kraken/issues/420 , I wanted to check what was going on.
Recap:
Between two different version of kraken, but mostly with different version of the deps, I lost 6-10 points of % on dev.
I saw you committed a potential fix here 33c1875 < https://github.com/mittagessen/kraken/commit/33c1875d6f438cbb38ec10ff100586bcc0cd1d81
I just launched a check experiment to see if we reach the same dev score with the fix, and thought a separate issue would be better (as ReduceLRonplateau still crashes)
— Reply to this email directly, view it on GitHub https://github.com/mittagessen/kraken/issues/421, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAZKGROOAPF4JZZOBBCJ3VDWUYXPLANCNFSM6AAAAAAUKEGWME
. You are receiving this because you are subscribed to this thread.Message ID: @.***>
— Reply to this email directly, view it on GitHub https://github.com/mittagessen/kraken/issues/421#issuecomment-1407616694, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOXEZW3MMXGIW7ZAVKINODWUY5H3ANCNFSM6AAAAAAUKEGWME . You are receiving this because you authored the thread.Message ID: @.***>
pls try adapting lrate to 0.0001 or 0.0004. It helped in my cases.
@dstoekl The same command with the same data yields different results multiple time, I would bet this has nothing to do with LR (and if I remember well, I actually tested lower LR in the first tries before I "debugged" it to be a deps/version issue)
@mittagessen Training is underway with lightning 1.7.7 and latest master. So I downgraded below 1.8, we'll see if that fixes it.
Training is underway with lightning 1.7.7 and latest master.
Does ReduceLROnPlateau
work with PL 1.7.7 ?
On Sun, Jan 29, 2023 at 3:30 PM Thibault Clérice @.***> wrote:
@dstoekl https://github.com/dstoekl The same command with the same data yields different results multiple time, I would bet this has nothing to do with LR (and if I remember well, I actually tested lower LR in the first tries before I "debugged" it to be a deps/version issue)
@mittagessen https://github.com/mittagessen Training is underway with lightning 1.7.7 and latest master. So I downgraded below 1.8, we'll see if that fixes it.
— Reply to this email directly, view it on GitHub https://github.com/mittagessen/kraken/issues/421#issuecomment-1407678934, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZKGRPDYCZETAYIQW4SC7LWUZ5HNANCNFSM6AAAAAAUKEGWME . You are receiving this because you commented.Message ID: @.***>
Does
ReduceLROnPlateau
work with PL 1.7.7 ?
I wanted to try first without it, to see if the issue was indeed PL or the fact I was not using --schedule reducelronplateau.
I can confirm that ROL does not work on 1.7.7 with latest master. But latest master does not work with 1.7.5 either, which worked with Kraken 4.1.2
FYI, I went back to using 4.1.2 with Pytorch-lightning 1.8.3. ReduceOnPlateau works over there while none of 1.7.7 and 1.8.3 worked with Kraken on master / 4.2.0
I test this to rule out PyTorch-Lightning being the issue.
My new hypothesis:
Ok, I can confidently say that it is neither the --schedule reduceonplateau
or pytorch-lightning
, but something in Kraken in between version or another dependency.
On 4.1.2, with PL 1.8,3, I reach .86 at epoch 6 (so ROP could not be activated).
The only different thing I can see that could affect training / recognition is
I cannot see anything else in the code that would relate to this. I feel like maybe what's hurting the ROP might also hurt training, and fixing one will fix the other ? (finger crossed ?)
So, to recap
Kraken | PL | ReduceOP | Score |
---|---|---|---|
Master | 1.8.4 | Crash | 82 |
Master | 1.7.7 | Crash | 82 |
4.1.2 | 1.8.4 | Works | 89 |
4.1.2 | 1.7.7 | Works | 89 |
So ROP does not seem to be tied to the issue: I "fixed" it and training seems to still behave weirdly (any training on 4.1.2 reachs 80+ acc at epoch 5, it went up and dived down at epoch 6 on master with ROP fix, cf. https://github.com/mittagessen/kraken/issues/420#issuecomment-1407754951 )
I guess the drop at epoch 6 is due to the scheduler kicking in. By default,
rop_patience
is set to 5 so, if you only monitor the loss after each
epoch (which IMO doesn't make sense for large datasets), the scheduler has
to wait for 5 epochs before actually reducing the LR.
On Sun, Jan 29, 2023 at 9:36 PM Thibault Clérice @.***> wrote:
So ROP does not seem to be tied to the issue: I "fixed" it and training seems to still behave weirdly (any training on 4.1.2 reachs 80+ acc at epoch 5, it went up and dived down at epoch 6 on master with ROP fix, cf. #420 (comment) https://github.com/mittagessen/kraken/issues/420#issuecomment-1407754951 )
— Reply to this email directly, view it on GitHub https://github.com/mittagessen/kraken/issues/421#issuecomment-1407763648, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZKGRMQ6VVXZWK3EALRKVLWU3IDVANCNFSM6AAAAAAUKEGWME . You are receiving this because you commented.Message ID: @.***>
I guess the drop at epoch 6 is due to the scheduler kicking in.
But that would not make sense, would it ? or it would indicate that ROP in itself is broken ? Because acc was raising for the first five epochs, but dropped by 2 points (from what I remember epoch 5 was 79% and epoch 6 was 77%).
Maybe you got stuck in a saddle point after the scheduler kicked in. I find it difficult to troubleshoot this issue without looking at the loss and LR graphs.
On Mon, Jan 30, 2023 at 9:02 AM Thibault Clérice @.***> wrote:
I guess the drop at epoch 6 is due to the scheduler kicking in.
But that would not make sense, would it ? or it would indicate that ROP in itself is broken ? Because acc was raising for the first five epochs, but dropped by 2 points (from what I remember epoch 5 was 79% and epoch 6 was 77%).
— Reply to this email directly, view it on GitHub https://github.com/mittagessen/kraken/issues/421#issuecomment-1408145227, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZKGRL3OKEJCLN3UNQH5O3WU5YQTANCNFSM6AAAAAAUKEGWME . You are receiving this because you commented.Message ID: @.***>
AFAIK, Schedulers monitor accuracy
and not loss in Kraken, which would mean it had no reason to kick in as it was not on a downward slope, no ?
This is accuracy plotting, I am running another test right now for longer on the fixed ROP.
Top models are 4.1.2 with different setting
(I have started monitoring the loss to compare between the two.)
I'll provide both arrows soon. I think I cannot really go further right now, mostly including the electric bill I am gonna get by testing this.
One thing which I have not checked is if the data are somehow treated differently, despite using the same arrow (empty lines ? encoding ?)
@mittagessen I am available for a call if you wanna talk about this ;)
Thanks for the work. I've got the French state paying for the electricity bill so I'll run some tests later. Can you send me the dataset? It would make it easier to build a test case without me having to find some training data as my collection got nuked.
The ROP only works on the validation metric (accuracy in this case), the loss is completely ignored. You could plot it but it usually isn't terribly indicative of anything (that's why I haven't spend the time to get it to be printed out on the progress bar because PTL makes that weirdly difficult).
It's currently being uploaded :) I am on a research trip in Poitiers, which makes connection quite... slow :)
Okidokey. I'll fix the ROP issue in the meanwhile.
Thanks. I've got a theory on why stuff might be broken. We're using manual optimization (for the learning rate scheduling/warmup necessary for pretraining) from 4.2.0 but don't actually set the automatic_optimization
flag in the trainer's __init__
. It's possible this causes aberrant behavior like the optimizer being called twice or something.
Keep us updated :)
OK, I can reproduce the drop somewhere between 4.1.2 and current master (even with automatic_optimization
set to false
). That's at least something.
Thank you for confirming this, I was starting to doubt myself :)
Unfortunately it means bisecting through the tree backwards to find the regression. And I suspect it is in the large refactoring commit for the pretraining.
Yup... THe good thing is from what I saw, it is detectable with only few epochs (4 or 5)
Found the bad commit (ecb47081d64eb42fdb66ce344f26576ed54ab480). Unfortunately, it is the large pretraining merge one.
Found the source of the error. Empty lines in binary datasets don't get properly filtered in master but they are in 4.1.2. I'll push a fix tomorrow.
WOoohoo ! (For two reasons: one you found it, and two I started feeling like it might be based on data... :) )
Can I ask if you could publish a release after that ? :D
On 23/02/01 12:23PM, Thibault Clérice wrote:
WOoohoo ! (For two reasons: one you found it, and two I started feeling like it might be based on data... :) )
Can I ask if you could publish a release after that ? :D
Yes, it's one of the last blocking regressions for a new release. I'm writing slow training tests to catch stuff like this and then tag one.
BTW you triggered this bug because your binary dataset contains a lot of lines that are transcribed as a single whitespace. These do get included in the compilation process (because they are not completely empty) but get squashed to empty strings by the default text processors in the dataset (and should therefore be filtered out). Adding 180 lines without good labels just breaks the model training.
As a follow-up to #420, I wanted to check what was going on.
Recap:
Between two different version of kraken, but mostly with different version of the deps, I lost 6-10 points of % on dev.
I saw you committed a potential fix here https://github.com/mittagessen/kraken/commit/33c1875d6f438cbb38ec10ff100586bcc0cd1d81
I just launched a check experiment to see if we reach the same dev score with the fix, and thought a separate issue would be better (as ReduceLRonplateau still crashes)