New tokenizing lets URLs and image names slip through

rhiever / reddit-analysis

A Python script that parses post titles, self-texts, and comments on reddit and makes word clouds out of the word frequencies.

285 stars 63 forks source link

New tokenizing lets URLs and image names slip through #37

Closed rhiever closed 11 years ago

rhiever commented 11 years ago

I had one come up where it had i.imgur.com and somefilename.jpg in the final results.

bboe commented 11 years ago

So I think this is where post-processing should be used. By itself i.imgur.com won't get through, but something like i.imgur.com/blah.jpg will get through and be parsed as:

['imgur.com', 'blah.jpg']

Is it really a problem though -- were those significant enough to show up in the graph? I doubt many people will write out the same url without the http:// prefix.

rhiever commented 11 years ago

Is it really a problem though -- were those significant enough to show up in the graph?

Yep, /r/androidcirclejerk had big i.imgur.com and somefilename.jpg words in the word cloud. I had to manually remove them.

bboe commented 11 years ago

Hmm -- maybe the best thing to do is to sort the output file by count so it's easy to make removals before running through wordle.

Edit: Ah, I think I have a fix.

bboe commented 11 years ago

Fix specific to this case in 6b4ff465d5cda78eacc0280680089b129b95f411. It will still allow somefilename.jpg to be added if it appears by itself, but that's probably okay if appearing by itself significant enough to be included in the graphic.

rhiever commented 11 years ago

:+1: