kakaobrain / coyo-dataset

COYO-700M: Large-scale Image-Text Pair Dataset
https://kakaobrain.com/contents?contentId=7eca73e3-3089-43cb-b701-332e8a1743fd
1.14k stars 35 forks source link

Poisoning Vulnerability on COYO-700M #12

Open carlini opened 1 year ago

carlini commented 1 year ago

With some coauthors at Google, I have developed an attack that would allow someone to poison 0.1% of your dataset. (For what the impact of such an attack could be, see e.g., https://arxiv.org/pdf/2106.09667.pdf or https://arxiv.org/abs/2205.06401). Previously poisoning attacks have been considered somewhat theoretical---in that we knew they could exist, but there weren't any practical ways to mount these attacks. With our new techniques I now have the power to poison the dataset of anyone who has downloaded COYO-700m since it was released (but I don't). We believe this attack is not currently being exploited in the wild, but are hoping to release a paper on this attack shortly.

As part of this paper we have developed techniques to remediate the attack. We would like to help you apply these defenses before we publish our paper. I would appreciate it if you could contact me (nicholas@carlini.com) at your convenience. I have previously emailed you additional details of this attack to the contact email address you provide (coyo@kakaobrain.com) if you'd like to know more.

carlini commented 1 year ago

Hi, just wanting to follow up on this -- we're hoping one of you will be able to get in contact with us so we can help mitigate any vulnerabilities before we publish our results.

mwbyeon commented 1 year ago

@carlini Hi, We will update the COYO dataset to include the SHA256 hash values shared by the co-authors and then release it soon. Thank you for your very impressive research and contribution.