Improve fake data generation

The fake data generated is currently sufficient to demonstrate the main visualisation of each of the supported files (with a few exceptions like YouTube Search History). However, the data lacks the quality needed to demonstrate the analysis features such as: time series analysis, frequency analysis and topic extraction. To properly demonstrate these features, at least some of the data needs to be based on real world data. For example, some of the text in tweets could be randomly selected from a corpus of real world statements. This would better show the topic extraction feature and word frequency. The time series analysis is not well demonstrated when using a uniform random distribution of dates. It would be better to either use some real world event dates or to generate them according to a different distribution, taking into account day of week etc.

Ideas for data sources:

public tweet corpus such as https://github.com/zfz/twitter_corpus/blob/master/full-corpus.csv
kaggle SMS corpus for whatsapp and telegram
youtube8m data set of youtube videos for search and watch history
most followed on instagram for instagram data

mrbrianevans / social-media-export-analyser

Improve fake data generation #59