Training score dropped - Githubissues

PonteIneptique commented 1 year ago

As a follow-up to #420, I wanted to check what was going on.

Recap:

Between two different version of kraken, but mostly with different version of the deps, I lost 6-10 points of % on dev.

I saw you committed a potential fix here https://github.com/mittagessen/kraken/commit/33c1875d6f438cbb38ec10ff100586bcc0cd1d81

I just launched a check experiment to see if we reach the same dev score with the fix, and thought a separate issue would be better (as ReduceLRonplateau still crashes)

colibrisson commented 1 year ago

Could you share your training arguments?

On Sun, Jan 29, 2023 at 10:08 AM Thibault Clérice @.***> wrote:

As a follow-up to #420 https://github.com/mittagessen/kraken/issues/420, I wanted to check what was going on.

Recap:

Between two different version of kraken, but mostly with different version of the deps, I lost 6-10 points of % on dev.

I saw you committed a potential fix here 33c1875 https://github.com/mittagessen/kraken/commit/33c1875d6f438cbb38ec10ff100586bcc0cd1d81

I just launched a check experiment to see if we reach the same dev score with the fix, and thought a separate issue would be better (as ReduceLRonplateau still crashes)

— Reply to this email directly, view it on GitHub https://github.com/mittagessen/kraken/issues/421, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZKGROOAPF4JZZOBBCJ3VDWUYXPLANCNFSM6AAAAAAUKEGWME . You are receiving this because you are subscribed to this thread.Message ID: @.***>

mittagessen commented 1 year ago

Yes, please see if that optimizer return format change fixes things. Or more importantly if downgrading pytorch-lightning reverses the dropped scores.

colibrisson commented 1 year ago

please see if that optimizer return format change fixes things.

Lightning documentation says that ReduceLROnPlateau requires a monitor:

# The ReduceLROnPlateau scheduler requires a monitordef
configure_optimizers(self):
    optimizer = Adam(...)
    return {
        "optimizer": optimizer,
        "lr_scheduler": {
            "scheduler": ReduceLROnPlateau(optimizer, ...),
            "monitor": "metric_to_track",
            "frequency": "indicates how often the metric is updated"
            # If "monitor" references validation metrics, then
"frequency" should be set to a
            # multiple of "trainer.check_val_every_n_epoch".
        },
    }

On Sun, Jan 29, 2023 at 10:57 AM mittagessen @.***> wrote:

Yes, please see if that optimizer return format change fixes things.

— Reply to this email directly, view it on GitHub https://github.com/mittagessen/kraken/issues/421#issuecomment-1407616806, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZKGRI3GWP3KAR4BWXLYTDWUY5KJANCNFSM6AAAAAAUKEGWME . You are receiving this because you commented.Message ID: @.***>

PonteIneptique commented 1 year ago

I opened the issue to keep track of the specific scoring issue.

Here is the command

env/bin/ketos train -u NFD --device cuda:0 --augment -f binary -t ./manifest-A-train.txt -e manifest-A-dev.txt -B 16 --fixed-splits --lrate 1e-3 -o model-TEST --spec "[1,120,0,1 Cr4,2,32,4,2 Gn32 Cr4,2,64,1,1 Gn32 Mp4,2,4,2 Cr3,3,128,1,1 Gn32 Mp1,2,1,2 S1(1x0)1,3 Lbx256 Do0.5 Lbx256 Do0.5 Lbx256 Do0.5]" --lag 15

This command does drop significantly (7 points, 82ish vs 89ish on dev) on master from today (so your fix @mittagessen does not change things).

dstoekl commented 1 year ago

I usually use --lrate 0.0001 if -B =1 for this type of mode. If you have a batchrate higher than 1 did you try to increase the lrate by sqroot(B)?, i.e. 0.0004 in your instance?

PonteIneptique commented 1 year ago

The same command, except for reduceonplateau, on a previous environment does reacher 7 points higher accuracy. I doubt it's related to anything else than either reduceonplateau (which Ben doubts is the case) or a dependency gone rogue.

On Sun, 29 Jan 2023, 10:57 am Colin Brisson, @.***> wrote:

Could you share your training arguments?

On Sun, Jan 29, 2023 at 10:08 AM Thibault Clérice @.***> wrote:

As a follow-up to #420 <https://github.com/mittagessen/kraken/issues/420 , I wanted to check what was going on.

Recap:

Between two different version of kraken, but mostly with different version of the deps, I lost 6-10 points of % on dev.

I saw you committed a potential fix here 33c1875 < https://github.com/mittagessen/kraken/commit/33c1875d6f438cbb38ec10ff100586bcc0cd1d81

I just launched a check experiment to see if we reach the same dev score with the fix, and thought a separate issue would be better (as ReduceLRonplateau still crashes)

— Reply to this email directly, view it on GitHub https://github.com/mittagessen/kraken/issues/421, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAZKGROOAPF4JZZOBBCJ3VDWUYXPLANCNFSM6AAAAAAUKEGWME

. You are receiving this because you are subscribed to this thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/mittagessen/kraken/issues/421#issuecomment-1407616694, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOXEZW3MMXGIW7ZAVKINODWUY5H3ANCNFSM6AAAAAAUKEGWME . You are receiving this because you authored the thread.Message ID: @.***>

dstoekl commented 1 year ago

pls try adapting lrate to 0.0001 or 0.0004. It helped in my cases.

PonteIneptique commented 1 year ago

@dstoekl The same command with the same data yields different results multiple time, I would bet this has nothing to do with LR (and if I remember well, I actually tested lower LR in the first tries before I "debugged" it to be a deps/version issue)

@mittagessen Training is underway with lightning 1.7.7 and latest master. So I downgraded below 1.8, we'll see if that fixes it.

colibrisson commented 1 year ago

Training is underway with lightning 1.7.7 and latest master.

Does ReduceLROnPlateau work with PL 1.7.7 ?

On Sun, Jan 29, 2023 at 3:30 PM Thibault Clérice @.***> wrote:

@dstoekl https://github.com/dstoekl The same command with the same data yields different results multiple time, I would bet this has nothing to do with LR (and if I remember well, I actually tested lower LR in the first tries before I "debugged" it to be a deps/version issue)

@mittagessen https://github.com/mittagessen Training is underway with lightning 1.7.7 and latest master. So I downgraded below 1.8, we'll see if that fixes it.

— Reply to this email directly, view it on GitHub https://github.com/mittagessen/kraken/issues/421#issuecomment-1407678934, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZKGRPDYCZETAYIQW4SC7LWUZ5HNANCNFSM6AAAAAAUKEGWME . You are receiving this because you commented.Message ID: @.***>

PonteIneptique commented 1 year ago

Does ReduceLROnPlateau work with PL 1.7.7 ?

I wanted to try first without it, to see if the issue was indeed PL or the fact I was not using --schedule reducelronplateau.

PonteIneptique commented 1 year ago

I can confirm that ROL does not work on 1.7.7 with latest master. But latest master does not work with 1.7.5 either, which worked with Kraken 4.1.2

PonteIneptique commented 1 year ago

FYI, I went back to using 4.1.2 with Pytorch-lightning 1.8.3. ReduceOnPlateau works over there while none of 1.7.7 and 1.8.3 worked with Kraken on master / 4.2.0

I test this to rule out PyTorch-Lightning being the issue.

My new hypothesis:

There might have been a regression somewhere, which might impact more than the LRScheduler.
ReduceOnPlateau might be solely responsible for the 7 points gain (and so this issue would be closed by fixing #420 )

PonteIneptique commented 1 year ago

Ok, I can confidently say that it is neither the --schedule reduceonplateau or pytorch-lightning, but something in Kraken in between version or another dependency.

On 4.1.2, with PL 1.8,3, I reach .86 at epoch 6 (so ROP could not be activated).

PonteIneptique commented 1 year ago

The only different thing I can see that could affect training / recognition is

The way metrics are logged ( https://github.com/mittagessen/kraken/compare/4.1.2...master#diff-974b9dde6ef2001d713a0eaaeb9bb2bd5f0f9c67e10eb1fb6d3aa1ccebee9701R165-R167 )
The way optimizer and scheduler are set-up or stepped: https://github.com/mittagessen/kraken/compare/4.1.2...master#diff-974b9dde6ef2001d713a0eaaeb9bb2bd5f0f9c67e10eb1fb6d3aa1ccebee9701R576-R602

I cannot see anything else in the code that would relate to this. I feel like maybe what's hurting the ROP might also hurt training, and fixing one will fix the other ? (finger crossed ?)

PonteIneptique commented 1 year ago

So, to recap

Kraken	PL	ReduceOP	Score
Master	1.8.4	Crash	82
Master	1.7.7	Crash	82
4.1.2	1.8.4	Works	89
4.1.2	1.7.7	Works	89

PonteIneptique commented 1 year ago

So ROP does not seem to be tied to the issue: I "fixed" it and training seems to still behave weirdly (any training on 4.1.2 reachs 80+ acc at epoch 5, it went up and dived down at epoch 6 on master with ROP fix, cf. https://github.com/mittagessen/kraken/issues/420#issuecomment-1407754951 )

colibrisson commented 1 year ago

I guess the drop at epoch 6 is due to the scheduler kicking in. By default, rop_patience is set to 5 so, if you only monitor the loss after each epoch (which IMO doesn't make sense for large datasets), the scheduler has to wait for 5 epochs before actually reducing the LR.

On Sun, Jan 29, 2023 at 9:36 PM Thibault Clérice @.***> wrote:

So ROP does not seem to be tied to the issue: I "fixed" it and training seems to still behave weirdly (any training on 4.1.2 reachs 80+ acc at epoch 5, it went up and dived down at epoch 6 on master with ROP fix, cf. #420 (comment) https://github.com/mittagessen/kraken/issues/420#issuecomment-1407754951 )

— Reply to this email directly, view it on GitHub https://github.com/mittagessen/kraken/issues/421#issuecomment-1407763648, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZKGRMQ6VVXZWK3EALRKVLWU3IDVANCNFSM6AAAAAAUKEGWME . You are receiving this because you commented.Message ID: @.***>

PonteIneptique commented 1 year ago

I guess the drop at epoch 6 is due to the scheduler kicking in.

But that would not make sense, would it ? or it would indicate that ROP in itself is broken ? Because acc was raising for the first five epochs, but dropped by 2 points (from what I remember epoch 5 was 79% and epoch 6 was 77%).

colibrisson commented 1 year ago

Maybe you got stuck in a saddle point after the scheduler kicked in. I find it difficult to troubleshoot this issue without looking at the loss and LR graphs.

On Mon, Jan 30, 2023 at 9:02 AM Thibault Clérice @.***> wrote:

I guess the drop at epoch 6 is due to the scheduler kicking in.

But that would not make sense, would it ? or it would indicate that ROP in itself is broken ? Because acc was raising for the first five epochs, but dropped by 2 points (from what I remember epoch 5 was 79% and epoch 6 was 77%).

— Reply to this email directly, view it on GitHub https://github.com/mittagessen/kraken/issues/421#issuecomment-1408145227, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZKGRL3OKEJCLN3UNQH5O3WU5YQTANCNFSM6AAAAAAUKEGWME . You are receiving this because you commented.Message ID: @.***>

PonteIneptique commented 1 year ago

AFAIK, Schedulers monitor accuracy and not loss in Kraken, which would mean it had no reason to kick in as it was not on a downward slope, no ?

PonteIneptique commented 1 year ago

This is accuracy plotting, I am running another test right now for longer on the fixed ROP.

Top models are 4.1.2 with different setting

PonteIneptique commented 1 year ago

(I have started monitoring the loss to compare between the two.)

PonteIneptique commented 1 year ago

I'll provide both arrows soon. I think I cannot really go further right now, mostly including the electric bill I am gonna get by testing this.

One thing which I have not checked is if the data are somehow treated differently, despite using the same arrow (empty lines ? encoding ?)

PonteIneptique commented 1 year ago

@mittagessen I am available for a call if you wanna talk about this ;)

mittagessen commented 1 year ago

Thanks for the work. I've got the French state paying for the electricity bill so I'll run some tests later. Can you send me the dataset? It would make it easier to build a test case without me having to find some training data as my collection got nuked.

The ROP only works on the validation metric (accuracy in this case), the loss is completely ignored. You could plot it but it usually isn't terribly indicative of anything (that's why I haven't spend the time to get it to be printed out on the progress bar because PTL makes that weirdly difficult).

PonteIneptique commented 1 year ago

It's currently being uploaded :) I am on a research trip in Poitiers, which makes connection quite... slow :)

mittagessen commented 1 year ago

Okidokey. I'll fix the ROP issue in the meanwhile.

PonteIneptique commented 1 year ago

Have fun ! https://drive.google.com/file/d/1w-OUa3W9uNCAYbsHhHoDff03vWXSoP8i/view?usp=sharing

mittagessen commented 1 year ago

Thanks. I've got a theory on why stuff might be broken. We're using manual optimization (for the learning rate scheduling/warmup necessary for pretraining) from 4.2.0 but don't actually set the automatic_optimization flag in the trainer's __init__. It's possible this causes aberrant behavior like the optimizer being called twice or something.

PonteIneptique commented 1 year ago

Keep us updated :)

mittagessen commented 1 year ago

OK, I can reproduce the drop somewhere between 4.1.2 and current master (even with automatic_optimization set to false). That's at least something.

PonteIneptique commented 1 year ago

Thank you for confirming this, I was starting to doubt myself :)

mittagessen commented 1 year ago

Unfortunately it means bisecting through the tree backwards to find the regression. And I suspect it is in the large refactoring commit for the pretraining.

PonteIneptique commented 1 year ago

Yup... THe good thing is from what I saw, it is detectable with only few epochs (4 or 5)

mittagessen commented 1 year ago

Found the bad commit (ecb47081d64eb42fdb66ce344f26576ed54ab480). Unfortunately, it is the large pretraining merge one.

mittagessen commented 1 year ago

Found the source of the error. Empty lines in binary datasets don't get properly filtered in master but they are in 4.1.2. I'll push a fix tomorrow.

PonteIneptique commented 1 year ago

WOoohoo ! (For two reasons: one you found it, and two I started feeling like it might be based on data... :) )

Can I ask if you could publish a release after that ? :D

mittagessen commented 1 year ago

On 23/02/01 12:23PM, Thibault Clérice wrote:

WOoohoo ! (For two reasons: one you found it, and two I started feeling like it might be based on data... :) )

Can I ask if you could publish a release after that ? :D

Yes, it's one of the last blocking regressions for a new release. I'm writing slow training tests to catch stuff like this and then tag one.

mittagessen commented 1 year ago

BTW you triggered this bug because your binary dataset contains a lot of lines that are transcribed as a single whitespace. These do get included in the compilation process (because they are not completely empty) but get squashed to empty strings by the default text processors in the dataset (and should therefore be filtered out). Adding 180 lines without good labels just breaks the model training.

mittagessen / kraken

Training score dropped #421