Closed Natsushiro closed 1 year ago
I tried to add the webdataset support following open_clip(https://github.com/mlfoundations/open_clip/tree/main). It seems to work, but I haven't verified the result.
Hi! Can you successfully get the training code running?
Hi! Can you successfully get the training code running?
Hi! I have this code correctly running by loading the webdataset format CC3M following open_clip, but since the dataset is too large I haven’t get the final result.
Hi! Can you successfully get the training code running?
Hi! I have this code correctly running by loading the webdataset format CC3M following open_clip, but since the dataset is too large I haven’t get the final result.
Thanks for your reply! I can't run the sample train code successfully with this error . It seems to be wrong with loading CC3M. Did you encounter and solve this problem?
Hi! Can you successfully get the training code running?
Hi! I have this code correctly running by loading the webdataset format CC3M following open_clip, but since the dataset is too large I haven’t get the final result.
Thanks for your reply! I can't run the sample train code successfully with this error . It seems to be wrong with loading CC3M. Did you encounter and solve this problem?
I think this is because you used CC3M as a csv format and the file cannot be recognized since the csv file of CC3M contains only the captions and the urls. My solution is to replace the data.py with the one in open_clip, and set the argument “—train-data” to the path where you download the webdataset CC3M, e.g.:
-current path -cc3m -00000.tar -00000_stats.json -00001.tar -00001_stats.json …
set to ./cc3m .
Since the code needs the number of samples in the dataset, I recommend to calculate it with the “successes” number recorded in the *_stats.json files of the downloaded dataset, since some of the samples may failed to be download due to one’s own network situation.
@Natsushiro Thanks for your suggestions. I think problem-solving has been further advanced! But I got new errors: my directory's structure is here:
root
├── cc_data
├── train ## training image directory: *.tar,*.json,*.parquet
└── val ## validation directory.
├── cc
├── Train_GCC-training_output.csv ## caption and url
└── Validation_GCC-1.1.0-Validation_output.csv ## caption and url
├── ...
Firstly, I set —train-data
by ./cc_data/train but I got error:
And then I tried to replace the data.py with the one in open_clip and set --dataset-type
by "synthetic"
got this error:
Finally, I use original data.py and set --dataset-type
by "directory" but got this error:
However, I have seen this:
Does that mean the code only can use jpg images?
And where the number of samples in the dataset is used?
If it is convenient for you, please send the contact information to this email address "zxjia2002@outlook.com" so that we can communicate efficiently!
@Natsushiro Thanks for your suggestions. I think problem-solving has been further advanced! But I got new errors: my directory's structure is here:
root ├── cc_data ├── train ## training image directory: *.tar,*.json,*.parquet └── val ## validation directory. ├── cc ├── Train_GCC-training_output.csv ## caption and url └── Validation_GCC-1.1.0-Validation_output.csv ## caption and url ├── ...
Firstly, I set
—train-data
by ./cc_data/train but I got error: And then I tried to replace the data.py with the one in open_clip and set--dataset-type
by "synthetic" got this error: Finally, I use original data.py and set--dataset-type
by "directory" but got this error: However, I have seen this: Does that mean the code only can use jpg images? And where the number of samples in the dataset is used? If it is convenient for you, please send the contact information to this email address "zxjia2002@outlook.com" so that we can communicate efficiently!
Sorry I made a mistake..The path should be like this: ”./cc3m/{00000..00XXX}.tar”(XXX being the number of shards of the webdataset you download). For the Namespace error, my solution is to add the corresponding arguments in src/param.py.
@Natsushiro Thanks for your suggestions. I think problem-solving has been further advanced! But I got new errors: my directory's structure is here:
root ├── cc_data ├── train ## training image directory: *.tar,*.json,*.parquet └── val ## validation directory. ├── cc ├── Train_GCC-training_output.csv ## caption and url └── Validation_GCC-1.1.0-Validation_output.csv ## caption and url ├── ...
Firstly, I set
—train-data
by ./cc_data/train but I got error: And then I tried to replace the data.py with the one in open_clip and set--dataset-type
by "synthetic" got this error: Finally, I use original data.py and set--dataset-type
by "directory" but got this error: However, I have seen this: Does that mean the code only can use jpg images? And where the number of samples in the dataset is used? If it is convenient for you, please send the contact information to this email address "zxjia2002@outlook.com" so that we can communicate efficiently!Sorry I made a mistake..The path should be like this: ”./cc3m/{00000..00XXX}.tar”(XXX being the number of shards of the webdataset you download). For the Namespace error, my solution is to add the corresponding arguments in src/param.py.
I use this path”./cc3m/{00000..00XXX}.tar",but I met one problem like
Hi! Can you successfully get the training code running? @zy1879046
I also encountered the same problem. May I know how to solve it? thanks.
Hello, thank you for sharing your exciting work. May I ask you what format of CC3M did you use for pre-training? Since the downloaded files following openclip is in a webdataset form, it seems that they cannot be directly used in your code. Again, thank you for your awesome work!