patcg-individual-drafts / topics

The Topics API
https://patcg-individual-drafts.github.io/topics/
Other
589 stars 168 forks source link

Request for Assistance in Replicating Re-Identification Risk Experiment #267

Open yashmaurya01 opened 8 months ago

yashmaurya01 commented 8 months ago

Hello,

I'm attempting to replicate the re-identification risk experiment detailed in the paper "Measuring Re-identification Risk." However, I'm encountering difficulties in accessing the Million Song Dataset, which was used in the empirical analysis. Unfortunately, the Echonest website appears to be down, preventing me from obtaining the necessary API key to access the dataset.

I would greatly appreciate any guidance on how to obtain the Million Song Dataset and replicate the experiment's results. Additionally, I'm seeking information on attribute mappings for the MSD dataset, specifically to simulate a scenario similar to the Topics API, which requires data such as browser history and the frequency of visits within a week for topic calculations.

Thank you for your assistance.

Best regards, Yash Maurya

aleepasto commented 8 months ago

Thank you for your interest in our paper "Measuring Re-identification Risk.". We are delighted to see interest in the research community in replicating our work and we are happy to assist you.

First, we would like to clarify that the MSD dataset was used in the paper exclusively for the purpose of allowing the academic researchers to experiment with a public dataset using our open source code. For this reason, we used a public dataset that has been part of many academic papers in the past. We did not intend however the MSD dataset to be considered similar or related to the Topics API, since the dataset is not based on browsing histories or Topics API outputs.

We refer to Section 8.5 of the paper (Measuring Re-identification Risk) where we discuss specifically how we generated samples from the MSD dataset to test the probability of matching correctly a sample based on the song ids. Notice that this data generation process is not similar to the Topics API sampling method, and the results on this dataset do not have implications for the re-id risk of the Topics API.

Concerning the data, the dataset appear to be still available at this repository http://millionsongdataset.com/tasteprofile/

svijayakumar2 commented 8 months ago

I think the issue is with the API key. We can't access the user data without a key but since the MSD moved ownership it doesn't seem publicly accessible anymore. Do you know how to circumvent this problem or can you confirm this is the case?

aleepasto commented 8 months ago

Hi, Thanks for the question. The specific dataset we have used appear to be available at this link (without requiring a key) http://millionsongdataset.com/sites/default/files/challenge/train_triplets.txt.zip

Please let me know if you have any other question.

suriya-ganesh commented 8 months ago

Hi @aleepasto , in the file you, the first and second columns seem to be some sort of Id. Were the experiments run over the ID or were the ID's decoded into their value? Thanks

aleepasto commented 8 months ago

Hi, we use the song ids associated to a given user in the dataset without any associated meaning to the ids. As reported in the Section 8.5 of the paper, we simulate a system that outputs a sample of r songs for each user, independently, to generate two different databases. Then, we measure the match rate across the two datasets for a fixed r.

AmanPriyanshu commented 8 months ago

Hi, this discussion is really interesting. I just wanted to clarify something about the million song implementation. So ideally Topics API's re-identification is going to be based on the attack model's ability to understand user behavior. These attacks would strongly depend on the some what deterministic nature of frequency counts for topics every epoch/week.

However, I was confused about why random r songs were chosen for these users instead of applying the same frequency counting? Won't the randomness never allow any patterns to be formed?

aleepasto commented 7 months ago

Thanks for the comment. Given the fundamental differences between the MSD dataset and the real Topics API implementation we did not intend to use that MSD dataset to model in any way the Topics API. For this reason, we did not attempt to mimic any part of the API behavior (e.g., fixing top k = 5 songs per user). We only included the dataset in the paper to allow external researchers to validate the theoretical and empirical modeling of our paper in a different context.

I hope this answers your question.