rom1504 / img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
MIT License
3.66k stars 336 forks source link

Noncompliance by default with the General Data Protection Regulation (GDPR/RGPD) #303

Closed jmaris closed 1 year ago

jmaris commented 1 year ago

Dear rom1504,

Please take note of the fact that your software is not compliant with the GDPR in at least some EU jurisdictions. While some personal information such as age or occupation that is published online would be considered as public domain, photos of individuals are not.

Use of this tool to download such software and use it for the purposes of processing images of others would inevitably infringe the GDPR.

Therefore I recommend you follow the proposal in #293 to avoid issues with your local data protection authority (CNIL, in your case)

Regards,

jmaris commented 1 year ago

As an additional point: data protection is not opt-in. You can't allow your users to break the GDPR as a default.

rom1504 commented 1 year ago

It is the downloader responsability to handle any personal data they download from urls they provide.

If the downloader is getting a lot of personal information from image urls, they should definitely filter the images to keep only the non personal information.

On Tue, Apr 25, 2023, 16:45 Jordan Maris @.***> wrote:

As an additional point: data protection is not opt-in. You can't allow your users to break the GDPR as a default.

— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/303#issuecomment-1521917689, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437TEJPJX5HQI4NAQ2P3XC7PQBANCNFSM6AAAAAAXLDIE7I . You are receiving this because you are subscribed to this thread.Message ID: @.***>

MarManHollow commented 1 year ago

It is the downloader responsability to handle any personal data they download from urls they provide. If the downloader is getting a lot of personal information from image urls, they should definitely filter the images to keep only the non personal information. On Tue, Apr 25, 2023, 16:45 Jordan Maris @.> wrote: As an additional point: data protection is not opt-in. You can't allow your users to break the GDPR as a default. — Reply to this email directly, view it on GitHub <#303 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437TEJPJX5HQI4NAQ2P3XC7PQBANCNFSM6AAAAAAXLDIE7I . You are receiving this because you are subscribed to this thread.Message ID: @.>

Lol. Go and read up on GPDR and then you hopefully will stop fighting windmills 🤷‍♂️

jmaris commented 1 year ago

You are correct that the downloader is responsible (though some jurisdictions question this approach), however as provided, your tool would break the GDPR by default, which should not be the default behavior, particularly for an EU-based project. I can request clarification with CNIL if you like. Given their opinion on Clearview, I'm sure this project would be of interest to them.

rom1504 commented 1 year ago

your tool would break the GDPR by default

If the downloader intends to break GPDR by first using a downloading tool and then storing data they are not allowed to do, they can certainly do so. That's also true for browsers, wget, or an internet connection in general.

Feel free to consult any authority as to whether using a downloading tool to download urls provided by the user fits the GPDR.

jmaris commented 1 year ago

your tool would break the GDPR by default

If the downloader intends to break GPDR by first using a downloading tool and then storing data they are not allowed to do, they can certainly do so. That's also true for browsers, wget, or an internet connection in general.

Feel free to consult any authority as to whether using a downloading tool to download urls provided by the user fits the GPDR.

The answer to that question is, if the content those URLs refers to is personal data, then the downloader is breaking the GDPR.

The question I would like to ask CNIL is if facilitating infringement by default is problematic

In any case you have made your position on the issue abundantly clear.

MostAwesomeDude commented 1 year ago

@jmaris: Please ask CNIL whether robots.txt declarations have any force, as that seems to be the crux of this particular matter.

In particular, is wget an acceptable tool? The only real difference between wget --spider and this sort of tool is that the former respects robots.txt. And, to be frank, it would be nice to know whether any authorities have problems with wget's existence.

rom1504 commented 1 year ago

wget --spider and this tool are very different

img2dataset is not a spider, it does not discover urls. It only downloads urls you tell it to.

On Thu, Apr 27, 2023, 16:43 Corbin Simpson @.***> wrote:

@jmaris https://github.com/jmaris: Please ask CNIL whether robots.txt declarations have any force, as that seems to be the crux of this particular matter.

In particular, is wget an acceptable tool? The only real difference between wget --spider and this sort of tool is that the former respects robots.txt. And, to be frank, it would be nice to know whether any authorities have problems with wget's existence.

— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/303#issuecomment-1525824239, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437RUKOCRT3PUVZ3XRBTXDKAYNANCNFSM6AAAAAAXLDIE7I . You are receiving this because you commented.Message ID: @.***>

jmaris commented 1 year ago

@MostAwesomeDude That's something of a disingenuous comparison to make.

MostAwesomeDude commented 1 year ago

@jmaris What did CNIL say?

jmaris commented 1 year ago

@jmaris What did CNIL say? Hey,

Sorry for not getting back sooner. I received a reply from CNIL today concerning img2dataset, stating that while they cannot take action against the tool itself, they could take action against anyone using this tool in the EU, as such processing would require the consent of the concerned parties.

Essentially, do not use this tool on images of EU citizens within the European Union.

As I sent my request prior to your question on robots.txt files, I did not include it in my question. However, I would say that we cannot compare the use of wget to this tool, as this tool is primarily focussed on building image datasets, whereas wget is a general purpose utility.

On the issue of respecting robots.txt, that is more an issue of copyright IMHO.