About the format of CC3M for pretraining

Natsushiro commented 1 year ago

Hello, thank you for sharing your exciting work. May I ask you what format of CC3M did you use for pre-training? Since the downloaded files following openclip is in a webdataset form, it seems that they cannot be directly used in your code. Again, thank you for your awesome work!

Natsushiro commented 1 year ago

I tried to add the webdataset support following open_clip(https://github.com/mlfoundations/open_clip/tree/main). It seems to work, but I haven't verified the result.

KeepAndWin commented 1 year ago

Hi! Can you successfully get the training code running?

Natsushiro commented 1 year ago

Hi! Can you successfully get the training code running?

Hi! I have this code correctly running by loading the webdataset format CC3M following open_clip, but since the dataset is too large I haven’t get the final result.

KeepAndWin commented 1 year ago

Hi! Can you successfully get the training code running?

Hi! I have this code correctly running by loading the webdataset format CC3M following open_clip, but since the dataset is too large I haven’t get the final result.

Thanks for your reply! I can't run the sample train code successfully with this error . It seems to be wrong with loading CC3M. Did you encounter and solve this problem?

Natsushiro commented 1 year ago

Hi! Can you successfully get the training code running?

Hi! I have this code correctly running by loading the webdataset format CC3M following open_clip, but since the dataset is too large I haven’t get the final result.

Thanks for your reply! I can't run the sample train code successfully with this error . It seems to be wrong with loading CC3M. Did you encounter and solve this problem?

I think this is because you used CC3M as a csv format and the file cannot be recognized since the csv file of CC3M contains only the captions and the urls. My solution is to replace the data.py with the one in open_clip, and set the argument “—train-data” to the path where you download the webdataset CC3M, e.g.:

-current path -cc3m -00000.tar -00000_stats.json -00001.tar -00001_stats.json …

set to ./cc3m .

Since the code needs the number of samples in the dataset, I recommend to calculate it with the “successes” number recorded in the *_stats.json files of the downloaded dataset, since some of the samples may failed to be download due to one’s own network situation.

KeepAndWin commented 1 year ago

@Natsushiro Thanks for your suggestions. I think problem-solving has been further advanced! But I got new errors: my directory's structure is here:

 root
   ├── cc_data
            ├── train ## training image directory: *.tar,*.json,*.parquet
            └── val ## validation directory.
  ├──  cc
            ├── Train_GCC-training_output.csv ## caption and url
            └── Validation_GCC-1.1.0-Validation_output.csv ## caption and url
  ├── ...

Firstly, I set —train-data by ./cc_data/train but I got error: And then I tried to replace the data.py with the one in open_clip and set --dataset-type by "synthetic" got this error: Finally, I use original data.py and set --dataset-type by "directory" but got this error: However, I have seen this: Does that mean the code only can use jpg images? And where the number of samples in the dataset is used? If it is convenient for you, please send the contact information to this email address "zxjia2002@outlook.com" so that we can communicate efficiently!

Natsushiro commented 1 year ago

@Natsushiro Thanks for your suggestions. I think problem-solving has been further advanced! But I got new errors: my directory's structure is here:
 root
   ├── cc_data
            ├── train ## training image directory: *.tar,*.json,*.parquet
            └── val ## validation directory.
  ├──  cc
            ├── Train_GCC-training_output.csv ## caption and url
            └── Validation_GCC-1.1.0-Validation_output.csv ## caption and url
  ├── ...
Firstly, I set —train-data by ./cc_data/train but I got error: And then I tried to replace the data.py with the one in open_clip and set --dataset-type by "synthetic" got this error: Finally, I use original data.py and set --dataset-type by "directory" but got this error: However, I have seen this: Does that mean the code only can use jpg images? And where the number of samples in the dataset is used? If it is convenient for you, please send the contact information to this email address "zxjia2002@outlook.com" so that we can communicate efficiently!

Sorry I made a mistake..The path should be like this: ”./cc3m/{00000..00XXX}.tar”(XXX being the number of shards of the webdataset you download). For the Namespace error, my solution is to add the corresponding arguments in src/param.py.

zy1879046 commented 11 months ago

@Natsushiro Thanks for your suggestions. I think problem-solving has been further advanced! But I got new errors: my directory's structure is here:
 root
   ├── cc_data
            ├── train ## training image directory: *.tar,*.json,*.parquet
            └── val ## validation directory.
  ├──  cc
            ├── Train_GCC-training_output.csv ## caption and url
            └── Validation_GCC-1.1.0-Validation_output.csv ## caption and url
  ├── ...
Firstly, I set —train-data by ./cc_data/train but I got error: And then I tried to replace the data.py with the one in open_clip and set --dataset-type by "synthetic" got this error: Finally, I use original data.py and set --dataset-type by "directory" but got this error: However, I have seen this: Does that mean the code only can use jpg images? And where the number of samples in the dataset is used? If it is convenient for you, please send the contact information to this email address "zxjia2002@outlook.com" so that we can communicate efficiently!
Sorry I made a mistake..The path should be like this: ”./cc3m/{00000..00XXX}.tar”(XXX being the number of shards of the webdataset you download). For the Namespace error, my solution is to add the corresponding arguments in src/param.py.

I use this path”./cc3m/{00000..00XXX}.tar",but I met one problem like

qqbangbangbang commented 10 months ago

Hi! Can you successfully get the training code running? @zy1879046

1028Bjt commented 6 months ago

I also encountered the same problem. May I know how to solve it? thanks.

google-research / composed_image_retrieval

About the format of CC3M for pretraining #9