knighton / splitta

Automatically exported from code.google.com/p/splitta
0 stars 0 forks source link

get_data shouldn't assume filenames #2

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
sbd.get_data shouldn't assume that you are passing in filenames.
Sometimes you already have text, turn it into an open file stream using
StringIO.StringIO, and pass that in.
See my code here:
   http://github.com/turian/common/blob/master/tokenizer.py

A suggested workaround is to add an optional parameter files_already_opened.

The following code implements this parameter:

93c93
< def get_data(files, expect_labels=True, tokenize=False, verbose=False,
files_already_opened=False):

---
> def get_data(files, expect_labels=True, tokenize=False, verbose=False):
108,111c108
<         if files_already_opened:
<             fh = file
<         else:
<             fh = open(file)

---
>         fh = open(file)
151,154c148
<         if files_already_opened:
<             pass
<         else:
<             fh.close()

---
>         fh.close()

I urge you to include the above patch, because then my tokenize.py code
(github link above) will work on splitta without asking you to patch the code.

A slightly cleaner implementation is to make get_data assume it is passed
file handles, and make a wrapper function get_data_from_filenames that will
open each file before calling get_data.

Original issue reported on code.google.com by tur...@gmail.com on 12 Apr 2010 at 7:19

GoogleCodeExporter commented 9 years ago
On that note, it is probably a good idea to add support for reading text from 
stdin, as many pipelines expect. With a very simple change, where I supply a 
special file name "stdin" and conditionally replace the file with sys.stdin 
seems to work great.

Original comment by y...@semanticmachines.com on 17 Feb 2015 at 4:34