mlfoundations / open_clip

An open source implementation of CLIP.
Other
9.29k stars 923 forks source link

oscillations in the loss #822

Open nicolas-dufour opened 5 months ago

nicolas-dufour commented 5 months ago

Hi! I see in the loss for cc3m model that there are oscillations in the loss. When using the same webdataset framework I also have oscillations as well. In my case it's more annoying because the oscillation amplitude is greater than the decrease of the loss per epoch.

Do you know what can be the reason of such behaviour? From my experiments, it seems to be linked to the webdataset since a traditional dataloader don't suffer from such issues. In my case the period of the oscillation is of the number of steps per epoch (on cc12m).

Thanks for the help!

rom1504 commented 5 months ago

Did you shuffle the dataset?

On Mon, Feb 19, 2024, 7:16 PM nicolas-dufour @.***> wrote:

Hi! I see in the loss for cc3m model that there are oscillations in the loss. When using the same webdataset framework I also have oscillations as well. In my case it's more annoying because the oscillation amplitude is greater than the decrease of the loss per epoch.

Do you know what can be the reason of such behaviour? From my experiments, it seems to be linked to the webdataset since a traditional dataloader don't suffer from such issues. In my case the period of the oscillation is of the number of steps per epoch (on cc12m).

Thanks for the help!

— Reply to this email directly, view it on GitHub https://github.com/mlfoundations/open_clip/issues/822, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437QQHCZX5DAIV62Z3ALYUOJHVAVCNFSM6AAAAABDP4MM22VHI2DSMVQWIX3LMV43ASLTON2WKOZSGE2DEOJSHA4TOMA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

nicolas-dufour commented 5 months ago

Hey @rom1504 Yes i'm shuffling shards and samples. I use the following settings that are the openclip defaults

        shard_shuffle_size=2000,
        shard_shuffle_initial=500,
        sample_shuffle_size=5000,
        sample_shuffle_initial=1000,

Thanks for the help!