mitdbg / palimpzest

A Declarative System for Optimizing AI Workloads
https://dsg.csail.mit.edu/projects/palimpzest/
MIT License
55 stars 10 forks source link

Request for more datasets #48

Open fmh1art opened 4 days ago

fmh1art commented 4 days ago

Hi, I am very exciting about the application scenario of Palimpzest and want to do more research on it. However, I have not found an appropriate dataset that support analysis over hundreds or thousands of data instances. Could you please share some urls of these datasets or provide a specific research direction where I may find relevant useful datasets?

mdr223 commented 3 days ago

Hi @fmh1art, if you execute the testdata/download_testdata.sh script (found here) this will download a set of datasets which we used in our CIDR paper. If you then execute the register-sources.sh script (found here) this will register those datasets with your local instance of PZ.

Once you have done that, the enron-eval dataset should have 1000 data instances (each instance is an email), whose labels can be found in testdata/groundtruth/enron-eval.csv. A label of 1 indicates that the email references fraudulent activity and a label of 0 indicates that it does not. Please let me know if this dataset is not suitable for your needs, and I'd be happy to suggest other options!

fmh1art commented 2 days ago

Thanks for your detailed reply! I am sorry I did not clarify my needs: I have already download all datasets you provided, but for some complex tasks like real-estate-eval, I find the dataset is incomplete (with only 30 data instances). If I want to evaluate on larger dataset with more complicated multimodal data, where can I find these datasets?

Hi @fmh1art, if you execute the testdata/download_testdata.sh script (found here) this will download a set of datasets which we used in our CIDR paper. If you then execute the register-sources.sh script (found here) this will register those datasets with your local instance of PZ.

Once you have done that, the enron-eval dataset should have 1000 data instances (each instance is an email), whose labels can be found in testdata/groundtruth/enron-eval.csv. A label of 1 indicates that the email references fraudulent activity and a label of 0 indicates that it does not. Please let me know if this dataset is not suitable for your needs, and I'd be happy to suggest other options!

mdr223 commented 1 day ago

Hi @fmh1art, our mistake for not uploading the dataset with all 100 real estate listings -- I've just uploaded it to the following location: https://palimpzest-workloads.s3.us-east-1.amazonaws.com/real-estate-eval-100.tar; you should be able to download the tar file with:

$ wget https://palimpzest-workloads.s3.us-east-1.amazonaws.com/real-estate-eval-100.tar