combine.py should allow comparisons between users or between users and fbcrawl output

berli0z commented 5 years ago

[x] Input any number of summaries as CSV files
[x] produce a csv or json with the following format:

source	postId	date	total times seen	total users seeing	average times seen by user	average impressionOrder	reactions
"Douglas Adams"	46dfaba8fa9afs7	2019-04-01	100	20	5	7.473	4726

Furthermore, in order to prepare to use this tool, we should include in the readme that you need two folders, one with all the fbtrex summary a and one with the fbcrawl sources outputs as csv files

berli0z commented 5 years ago

This has higher priority due to the upcoming eu19 initiative

berli0z commented 5 years ago

this could be outputted as either hourly, daily (for time series) or total aggregate (for pie-charts or so)

vecna commented 5 years ago

Thanks for the progress, four questions :)

It is unique by postId this aggregation, correct?
reactions I'm afraid can be misleading, they make sense only if associated to the time of acquisition?
the text is it part of the CSV, why strip it?
how can you plot hourly? as far as i remember, the time returned by fbcrawl is only by day.

berli0z commented 5 years ago

* It is unique by postId this aggregation, correct?
Exactly, no postId is being repeated in the dataset. sometimes the same post, if changed slightly, has different postId. that stays, further aggregation in that sense could be done
* reactions I'm afraid can be misleading, they make sense only if associated to the time of acquisition?

i get the reactions at the time of acquisition in fbtrex. they can be a bit misleading, but by getting the last value collected by the fbcrawler (which is always collected at 12.00 the day after), we can somehow see the reactions outcome (not the development in time)

* the text is it part of the CSV, why strip it?

we can also keep the text. the way i am trying to design the datasets is "as clean as possible". there is little change to visualize texts as they are now, unless we are making a "summary explorer" which is another thing i have in mind. if we want to visualize texts we can build a separate dataset which would be cleaner and faster to process in a visualization, optimized for that.. that was my guess. less is more when preparing datasets for visualization, having something more just because it's there is not the most popular approach (as far as i've read around)

* how can you plot hourly? as far as i remember, the time returned by fbcrawl is only by day.
fbcrawl returns the values until a specific day, but every post in the dataset has a datetime value (basically publicationTime)

tracking-exposed / dashboard

combine.py should allow comparisons between users or between users and fbcrawl output #6