tracking-exposed / dashboard

dashboard is the toolkit for data scientist who uses tracking.exposed services
https://facebook.tracking.exposed
5 stars 5 forks source link

combine.py should allow comparisons between users or between users and fbcrawl output #6

Closed berli0z closed 5 years ago

berli0z commented 5 years ago
source postId date total times seen
total users seeing
average times seen by user
average impressionOrder reactions
"Douglas Adams" 46dfaba8fa9afs7 2019-04-01 100
20 5 7.473 4726

Furthermore, in order to prepare to use this tool, we should include in the readme that you need two folders, one with all the fbtrex summary a and one with the fbcrawl sources outputs as csv files

berli0z commented 5 years ago

This has higher priority due to the upcoming eu19 initiative

berli0z commented 5 years ago

this could be outputted as either hourly, daily (for time series) or total aggregate (for pie-charts or so)

vecna commented 5 years ago

Thanks for the progress, four questions :)

berli0z commented 5 years ago
* It is unique by postId this aggregation, correct?

Exactly, no postId is being repeated in the dataset. sometimes the same post, if changed slightly, has different postId. that stays, further aggregation in that sense could be done

* reactions I'm afraid can be misleading, they make sense only if associated to the time of acquisition?

i get the reactions at the time of acquisition in fbtrex. they can be a bit misleading, but by getting the last value collected by the fbcrawler (which is always collected at 12.00 the day after), we can somehow see the reactions outcome (not the development in time)

* the text is it part of the CSV, why strip it?

we can also keep the text. the way i am trying to design the datasets is "as clean as possible". there is little change to visualize texts as they are now, unless we are making a "summary explorer" which is another thing i have in mind. if we want to visualize texts we can build a separate dataset which would be cleaner and faster to process in a visualization, optimized for that.. that was my guess. less is more when preparing datasets for visualization, having something more just because it's there is not the most popular approach (as far as i've read around)

* how can you plot hourly? as far as i remember, the time returned by fbcrawl is only by day.

fbcrawl returns the values until a specific day, but every post in the dataset has a datetime value (basically publicationTime)