get_data shouldn't assume filenames

sbd.get_data shouldn't assume that you are passing in filenames.
Sometimes you already have text, turn it into an open file stream using
StringIO.StringIO, and pass that in.
See my code here:
   http://github.com/turian/common/blob/master/tokenizer.py

A suggested workaround is to add an optional parameter files_already_opened.

The following code implements this parameter:

93c93
< def get_data(files, expect_labels=True, tokenize=False, verbose=False,
files_already_opened=False):

---
> def get_data(files, expect_labels=True, tokenize=False, verbose=False):
108,111c108
<         if files_already_opened:
<             fh = file
<         else:
<             fh = open(file)

---
>         fh = open(file)
151,154c148
<         if files_already_opened:
<             pass
<         else:
<             fh.close()

---
>         fh.close()

I urge you to include the above patch, because then my tokenize.py code
(github link above) will work on splitta without asking you to patch the code.

A slightly cleaner implementation is to make get_data assume it is passed
file handles, and make a wrapper function get_data_from_filenames that will
open each file before calling get_data.

Original issue reported on code.google.com by tur...@gmail.com on 12 Apr 2010 at 7:19

knighton / splitta

get_data shouldn't assume filenames #2