microsoft / Cream

This is a collection of our NAS and Vision Transformer work.
MIT License
1.61k stars 220 forks source link

RuntimeError! #219

Open SomnusQue opened 5 months ago

SomnusQue commented 5 months ago

I run auto_100weight_inherit_100to75.sh, and meet this problem. I think I have been ready everything for this project, but it still have some problems which I can't solve. Please somebody help me!

SomnusQue commented 5 months ago

2859391705835847

wkcn commented 5 months ago

Hi @SomnusQue , thanks for your attention to our work!

Is the code of TinyCLIP latest?

It is a bug which is triggered on PyTorch 2.x. We have fixed the bug by adding this line: https://github.com/microsoft/Cream/blob/main/TinyCLIP/src/open_clip/model.py#L28

checkpoint = functools.partial(checkpoint, use_reentrant=False)
SomnusQue commented 5 months ago

Hi @SomnusQue , thanks for your attention to our work!

Is the code of TinyCLIP latest?

It is a bug which is triggered on PyTorch 2.x. We have fixed the bug by adding this line: https://github.com/microsoft/Cream/blob/main/TinyCLIP/src/open_clip/model.py#L28

checkpoint = functools.partial(checkpoint, use_reentrant=False)

OMG! The author answer my question! The code which I have really doesn't have these lines! Thx for your patience! But I wondering when is the code update?

SomnusQue commented 5 months ago

Hi @SomnusQue , thanks for your attention to our work! Is the code of TinyCLIP latest? It is a bug which is triggered on PyTorch 2.x. We have fixed the bug by adding this line: https://github.com/microsoft/Cream/blob/main/TinyCLIP/src/open_clip/model.py#L28

checkpoint = functools.partial(checkpoint, use_reentrant=False)

OMG! The author answer my question! The code which I have really doesn't have these lines! Thx for your patience! But I wondering when is the code update? Furthermore... Is this LOSS normal? 2859731705841213_ pic_hd

wkcn commented 5 months ago

@SomnusQue I fixed the bug in Jan. 11, 2024 (https://github.com/microsoft/Cream/pull/218/files#diff-2c756c8b8b99609dee1b59ce4dcfaf773aa9afbc84e093e03e3e0de653fa0124R28).

You can visualize the loss curve in wandb. The loss is normal if it is decreasing : )

SomnusQue commented 5 months ago

@SomnusQue I fixed the bug in Jan. 11, 2024 (https://github.com/microsoft/Cream/pull/218/files#diff-2c756c8b8b99609dee1b59ce4dcfaf773aa9afbc84e093e03e3e0de653fa0124R28).

You can visualize the loss curve in wandb. The loss is normal if it is decreasing : )

Thanks for your patience! Due to the cluster, I can't use wandb(because it needs network..?), I change this line in .sh file'--report-to wandb' to '--report-to tensorboard'. Does it have anywhere else need to change in the code?

wkcn commented 5 months ago

@SomnusQue No code change required. It is also available to set the environmental variable WANDB_MODE=offline. The wandb log will be saved as a file. Then run wandb sync <file path> to upload the log.

SomnusQue commented 5 months ago

@SomnusQue No code change required. It is also available to set the environmental variable WANDB_MODE=offline. The wandb log will be saved as a file. Then run wandb sync <file path> to upload the log.

sry to bother u again... 3971705891663_ pic_hd The result in tensorboard seems like sth went wrong... 3981705892612_ pic_hd This is the final epoch of my training result..

SomnusQue commented 5 months ago

3991705910386_ pic 4001705910406_ pic This is our bash file, is there sth wrong...?

wkcn commented 5 months ago

Sorry that I did not test TensorBoard yet.

The training data in the provided script is synthetic. They should be replaced with the following command:

 --train-data <your yfcc_path or laion_path/> \
 --dataset-type webdataset \
SomnusQue commented 5 months ago

Sorry that I did not test TensorBoard yet.

The training data in the provided script is synthetic. They should be replaced with the following command:

 --train-data <your yfcc_path or laion_path/> \
 --dataset-type webdataset \

I downloaded laion file, and put it in the path '/.cache/clip/'. Is this the path I need to write?

wkcn commented 5 months ago

@SomnusQue Please refer to the document https://github.com/mlfoundations/open_clip?tab=readme-ov-file#data