ocaml-batteries-team / batteries-included

Batteries Included project
http://ocaml-batteries-team.github.com/batteries-included/hdoc2/
Other
519 stars 106 forks source link

batteries' input_line is slower than the stdlib's one #520

Open UnixJunkie opened 10 years ago

UnixJunkie commented 10 years ago

I can read a 160041 lines file ~4.9 times faster using Legacy.input_line...

c-cube commented 10 years ago

I think we need many more benchmarks...

UnixJunkie commented 10 years ago

The culprit looks like BatIO and Enum, that add some wrapping around each call to read_line.

agarwal commented 10 years ago

You might be interested in this thread [1] on the ocsigen list too. Some of the ideas are not lwt specific.

On Wed, Jan 29, 2014 at 4:08 AM, Francois Berenger <notifications@github.com

wrote:

The culprit looks like BatIO and Enum, that add some wrapping around each call to read_line.

Reply to this email directly or view it on GitHubhttps://github.com/ocaml-batteries-team/batteries-included/issues/520#issuecomment-33567114 .

hcarty commented 10 years ago

@agarwal I think you missed the link in your comment.

agarwal commented 10 years ago

Sorry. Here it is:

[1] https://sympa.inria.fr/sympa/arc/ocsigen/2013-12/msg00001.html

On Wed, Jan 29, 2014 at 10:35 AM, Hezekiah M. Carty < notifications@github.com> wrote:

@agarwal https://github.com/agarwal I think you missed the link in your comment.

Reply to this email directly or view it on GitHubhttps://github.com/ocaml-batteries-team/batteries-included/issues/520#issuecomment-33595086 .

c-cube commented 10 years ago

I looked at the code, and the source of slowness is probably twofold:

UnixJunkie commented 10 years ago

I used File.lines_of, but yes then at some point it calls BatIO.lines_of.

UnixJunkie commented 10 years ago

I think I should close this (since I created it): this is a too old performance regression compared to the stdlib that was introduced too long time ago.

gasche commented 10 years ago

I don't think there is anything wrong with fixing old issues. c-cube looks interested in the performance aspect of BatIO (I personally tend to loathe IO-related stuff, so I should apologize for happily staying away of the discussion). It would help to have your actual benchmark code, though -- I would guess that most reasons we suspect could cause this regression will turn out not to matter that much in a realistic workflow, with one actual suspect being guitly by a large margin.

Thank you for looking at this!

c-cube commented 10 years ago

Yet another argument to put IO in a separate library imho: if people want to use batteries in combination with libraries that use the standard input and ouput channels (and know what happens underneath, IO contains some bloat, like weak sets and whatnot).

rgrinberg commented 10 years ago

Yep. I'm also pretty sure that IO brings in Unix. Separating the core of batteries from Unix is one of the main goals of refactoring.

UnixJunkie commented 10 years ago

Implementing a wc -l in ocaml with batteries (File.lines_of) is enough to see the problem. The same program using (Legacy.open_in, Legacy.input_line, Legacy.close_in) will be faster I bet. Here is my version trying to avoid batteries' IO:

open Batteries
open Printf

module MU = My_utils

let with_in_file fn f =
  let input = Legacy.open_in fn in
  let res = f input in
  Legacy.close_in input;
  res

let main () =

  let nb_lines = ref 0 in

  let _all_lines =
    with_in_file Sys.argv.(1) (fun input ->
      let res, _eof =
        MU.unfold_exc
          (fun () -> let l = Legacy.input_line input in
                     incr nb_lines;
                     l
          )
      in
      res
    )
  in
  printf "init %d lines\n" !nb_lines;
;;

main ()

MU.unfold_exc is the new constructor I am pushing for in BatList.