louismullie / treat

Natural language processing framework for Ruby.
Other
1.36k stars 128 forks source link

Fix method contract bug in detect_format #40

Closed turadg closed 11 years ago

turadg commented 11 years ago

Calls to detect_format() passed a file, but the method expected a filename.

/Users/turadg/Code/Ruby/treat/lib/treat/entities/entity/buildable.rb:
  214      else
  215        fmt = Treat::Workers::Formatters::
  216:       Readers::Autoselect.detect_format(file,def_fmt)
  217        from_raw_file(file, fmt)
  218      end

/Users/turadg/Code/Ruby/treat/lib/treat/workers/formatters/readers/autoselect.rb:
   13    def self.read(document, options = {})
   14      options = DefaultOptions.merge(options)
   15:     fmt = detect_format(document.file, options[:default_to])
   16      Treat::Workers::Formatters::Readers.
   17      const_get(fmt.cc).read(document,options)
   18    end
   19    
   20:   def self.detect_format(filename, default_to = nil)
   21      
   22      default_to ||= DefaultOptions[:default_to]

3 matches across 2 files

This patch changes the parameter to a file as expected.

louismullie commented 11 years ago

Using the Treat.paths.file setting allows you to change the download directory. Any reason why you specifically need to use Tempfile?

louismullie commented 11 years ago

Sorry, it should be Treat.paths.files.

turadg commented 11 years ago

For one, I didn't know about Treat.paths.files. ;)

But I I also need an in-memory copy of the file contents, so I figure it's best to read from the URL directly to memory, then write to disk so Treat can process it. When there is an in-memory copy of the file, why does Treat need it to be stored on disk as well?

louismullie commented 11 years ago

The idea is to make the download large-file-proof by streaming chunks of the download to disk rather than reading the whole file into memory directly. Eventually, I would want to integrate a similar lazy iteration process within the entity classes as well.

For your use case, I would just do:

doc = Treat::Entities::Document.from_raw_file(file.path)