recsyschallenge / 2017

40 stars 24 forks source link

Duplicate lines in interactions.csv #11

Closed TiphaineV closed 7 years ago

TiphaineV commented 7 years ago

Hello,

In the data, there seems to be a lot (about 1.10^8) of duplicate interactions [1], i.e. lines that have all fields equal (including timestamp). Should we remove them, or treat them as legit interactions? In the latter case, what does it mean?

Thanks, Tiphaine.

[1] Found with the command: tail -n +2 interactions.csv |sort -T. -S4g|uniq|wc -l (which yields the number of distinct lines in interactions.csv).

dkohlsdorf commented 7 years ago

Hey, the impressions are measured per day, so there can be duplicates. Which means on that day we had two impressions or however many are included. You can decide if you need this information or not

Daniel

jbochi commented 7 years ago

Hi Daniel. Does the evaluation metric counts repeated impressions multiple times?

Thanks Juarez

fabianabel commented 7 years ago

Hi Juarez, thank you for contributing! Regarding your question: no, impressions (interaction_type=0) are not considered by the evaluation metric. And interactions such as clicks, bookmarks, etc. are counted only once, i.e.: "user clicked at least once on the pushed recommendation, i.e. multiple clicks won't increase the points". Thanks, fabian

TiphaineV commented 7 years ago

Hi,

Thanks for the informations. :) I'll close this issue.

Best, Tiphaine.