knoriy / CLARA

Apache License 2.0
61 stars 3 forks source link

how to download/prepare datasets #10

Closed shahrukhx01 closed 10 months ago

shahrukhx01 commented 10 months ago

@knoriy First of all thanks for the great contribution. I am trying to reproduce the results using the medium model checkpoint on the underlying datasets from the paper. However, I am unsure how the datasets can accessed. Could you please point me to how I can prepare/download the datasets? Thank you!

shahrukhx01 commented 10 months ago

FYI: The S3 bucket is not publicly accessible unfortunately.

Screenshot 2023-11-14 at 02 52 47
knoriy commented 10 months ago

Hi, Unfortunately, we can't share the processed data used in training CLARA, but you can use the Laion audio dataset scripts to download and process the data as needed. Most of the datasets are publicly available.

The CLARA codebase uses TorchData/webdataset format. In the future, I'd like to release a pipeline to perform augmentation and data processing.