Using TinyCLIP with other datasets.

GeorgiosSmyrnis commented 7 months ago

Hello,

Thanks a lot for your great work on the TinyCLIP paper!

I want to use your code to apply cross-modal distillation on CLIP models using other datasets (e.g. CC3M / CC12M). It seems that currently the provided scripts use synthetic data, and replacing that with e.g. webdataset versions of the aforementioned datasets as in OpenCLIP seems to throw errors when calling get_wds_dataset.

I would be grateful if you could provide some pointers on how to use your codebase with other datasets!

wkcn commented 7 months ago

Thanks for your attention to our work!

Could you please tell me what errors are raised?

The dataset is used in the same way as OpenCLIP.

GeorgiosSmyrnis commented 7 months ago

Thank you for the reply!

What happens is that, if the data is in webdataset format, then get_wds_dataset is being passed a tokenizer parameter here https://github.com/microsoft/Cream/blob/73afa00ae492928e836bfbe2f249ce08a655cae9/TinyCLIP/src/training/data.py#L523-L524 while its signature here does not accept such a tokenizer parameter https://github.com/microsoft/Cream/blob/73afa00ae492928e836bfbe2f249ce08a655cae9/TinyCLIP/src/training/data.py#L314

It doesn't seem that this version of get_wds_dataset uses the tokenizer (tokenization appears to be hardcoded), so maybe this can be ignored. However, if you would suggest I convert the data to a different format e.g. csv, then I can do that as well.

wkcn commented 7 months ago

Hi @GeorgiosSmyrnis, thank you for pointing it out!

I have fixed the bug in the latest code.

GeorgiosSmyrnis commented 7 months ago

Hi @wkcn,

Thank you very much for the fix! I'll use the newest version and post here again if I have any issues.

microsoft / Cream

Using TinyCLIP with other datasets. #206