Data Filtering Princeton: Decision Needed

@knowtheory @jepstein @betling @rhelmer @marniepw for visibility Hi, all. Princeton has new deliverables for us. Princeton's IRB people don't want them to have any PII. After talking with @betling, I understand this is similar to what we do for Stanford. He says that we could do this through the extension itself or get Princeton to write a file in the codebase specifying requirements and filters, and then SRE will write up a query and create an environment for them to have access to.

According to Princeton, sites should be divided into three categories:

Greatest interest to Researchers Least sensitive, users typically won’t be submitting content, etc. Domain level allow list Ex. News sites
Intermediate Mozilla makes available less info to researchers Part domain allow list Hash of the domain Redact PII
Everything else Only collect domain and hash of the URL

We want to ensure that Princeton does not have access to unfiltered data and that this info is conveyed in the survey to users - and short and long-form study descriptions.

**Any suggestions for the best way to go about this? Who will own this process? Any further questions? If anyone would like to be invited to our follow-up with Princeton this Friday for more information, please let me know.

It would be safe to assume the Search Engine study will require the same format. It is going back under IRB review for more funding.**

mozilla-rally / rally

Data Filtering Princeton: Decision Needed #456