Unfiltered data and post-processing support

rhiever / reddit-analysis

A Python script that parses post titles, self-texts, and comments on reddit and makes word clouds out of the word frequencies.

286 stars 63 forks source link

Unfiltered data and post-processing support #32

Closed bboe closed 11 years ago

bboe commented 11 years ago

The program should save a raw, unfiltered version of a run to a file such as raw-r-<SUBREDDIT>-<PERIOD>.json or raw-u-<REDDITOR>.json (let's make it json since it's easier to load that way).

Then as it currently does it should produce a file with the filtered results.

When the program is run, and an appropriate unfiltered json file already exists for the combination of user or subreddit and period, then rather than fetching new data it should simply load the results from the existing file and create a new output file using whatever filters are selected.

This way its easy to apply filters after the initial pass if the results don't look very appealing.

rhiever commented 11 years ago

Just a note for future coding. We probably want the JSON to look like this:

>>> import json
>>> print json.dumps({'4': 5, '6': 7}, sort_keys=True,
...                  indent=4, separators=(',', ': '))
{
    "4": 5,
    "6": 7
}

and replace the {'4': 5, '6': 7} with the popularWords dictionary.

rhiever commented 11 years ago

A thought on using JSON: won't it make it harder for novice users to extract the data from the output files to put into wordle?

bboe commented 11 years ago

So I was actually thinking these json files would be "invisible" to the user (stored in a tmp directory) and they just re-use the tool to change the filters. If they want an unfiltered version that's easy to manually edit, they can just re-run without filters.

rhiever commented 11 years ago

The potential problem with that is that they could have all these hidden JSON files building up in the temp directory, or lose the data because they don't know how to work with JSON files. I'm thinking it might be easier to always output a filtered csv file, then have a command-line option to also output a raw csv file (with no filtering whatsoever).

bboe commented 11 years ago

Good point. In that case I would default to always outputting the unfiltered current-format file just in-case.

rhiever commented 11 years ago

That works too. :+1: