Closed GeorgiosSmyrnis closed 7 months ago
Thanks for your attention to our work!
Could you please tell me what errors are raised?
The dataset is used in the same way as OpenCLIP.
Thank you for the reply!
What happens is that, if the data is in webdataset
format, then get_wds_dataset
is being passed a tokenizer parameter here https://github.com/microsoft/Cream/blob/73afa00ae492928e836bfbe2f249ce08a655cae9/TinyCLIP/src/training/data.py#L523-L524 while its signature here does not accept such a tokenizer parameter https://github.com/microsoft/Cream/blob/73afa00ae492928e836bfbe2f249ce08a655cae9/TinyCLIP/src/training/data.py#L314
It doesn't seem that this version of get_wds_dataset
uses the tokenizer (tokenization appears to be hardcoded), so maybe this can be ignored. However, if you would suggest I convert the data to a different format e.g. csv, then I can do that as well.
Hi @GeorgiosSmyrnis, thank you for pointing it out!
I have fixed the bug in the latest code.
Hi @wkcn,
Thank you very much for the fix! I'll use the newest version and post here again if I have any issues.
Hello,
Thanks a lot for your great work on the TinyCLIP paper!
I want to use your code to apply cross-modal distillation on CLIP models using other datasets (e.g. CC3M / CC12M). It seems that currently the provided scripts use synthetic data, and replacing that with e.g. webdataset versions of the aforementioned datasets as in OpenCLIP seems to throw errors when calling
get_wds_dataset
.I would be grateful if you could provide some pointers on how to use your codebase with other datasets!