Slow lines reading (even slower than python)

scriptum commented 9 years ago

I tested only naive implementations for C, Nim, Python in Linux x86_64:

/* C */
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv)
{
    char *buf = NULL;
    size_t i = 0, len;
    while(getline(&buf, &len, stdin) != -1) i++;
    printf("%d\n", i);
    free(buf);
}

# Python
import sys
i = 0
for line in sys.stdin:
    i+=1
print i

# Nim
var i = 0
# ugly but faster because of string copying
proc main()=
  for line in stdin.lines:
    inc i
  echo i
main()

Generate test data (100MB):

tr -c [:print:] ' ' </dev/urandom | fold -w1000 | head -100000 > test.data

And test like this: ./test < test.data

Results:

Nim (gcc -O2): 0.62
Nim (gcc -O3): 0.31
Python: 0.059
C: 0.031

def- commented 9 years ago

This is mostly caused by Nim having its unique handling of newlines, see also http://forum.nim-lang.org/t/503 . Using the idea from this thread it's a bit faster, but still slower than Python's and C's implementation: https://github.com/def-/nim-unsorted/blob/master/bufferedfile.nim

The easiest solution would be to use C's definition of newlines and inherit the speed. Otherwise here's the code, maybe you have some other ideas: https://github.com/nim-lang/Nim/blob/devel/lib/system/sysio.nim#L89-L122 (Maybe resizing the result string can be done a bit more cleverly)

scriptum commented 9 years ago

Why not to use fgets or getline? Or just fread_unlocked/mmap next strchr?

def- commented 9 years ago

getline is POSIX specific. Both fgets and getline don't handle CR LF. And iirc it was platform dependent how fgets handles CR LF.

Varriount commented 9 years ago

Wait, Python handles reading newlines the same way as Nim, so that can't be the reason for the slowness... Has anyone ever actually compared the two implementations?

https://docs.python.org/2.3/whatsnew/node7.html

scriptum commented 9 years ago

Wait, Python handles reading newlines the same way as Nim, so that can't be the reason for the slowness... Has anyone ever actually compared the two implementations?

https://github.com/python-git/python/blob/master/Objects/fileobject.c#L2669

It uses fread as I suggested.

Not fastest solution but this is better than in Nim. It's better to use optimized strchr from libc or something similar.

UPD: It's better to combine with optimized memchr for searching two chars (CL+RF) instead of one: https://github.com/esmil/musl/blob/master/src/string/memchr.c If anyone knows how to deal with aligned memory in Nim:)

Araq commented 9 years ago

Which OS are you talking about?

scriptum commented 9 years ago

Linux

I also tested -flto flag which improves performance for 10%

jyapayne commented 9 years ago

Why can't you do a getline or fgets and then just check if the last character is a CR? If it is, remove it from the string, otherwise return the full string.

Varriount commented 9 years ago

@jyapane That only works if the system the program is running on uses LF as its newline character. On Windows (where the newline is CR-LF) single LF characters would be skipped over.

jyapayne commented 9 years ago

How would single LF characters be skipped?

Steps:

Use fgets to parse up to and including an LF character
If the second last character is a CR, remove both the CR and LF characters
Otherwise, just remove the LF character

Wouldn't that take care of both CR-LF and LF?

scriptum commented 9 years ago

There could be a rare case of single CR (old Mac maybe) and fgets could skip that. Python handles all cases: single CR, single LF, CR+LF, its implementation much better than Nim but it's very ineffective too. fgets uses power of sse acceleration and searches 3-4 times faster than naive while loop. By the way this is a problem of all Nim's string operations (it doesn't use glibc in most cases and may be slower than scripting languages which use libc).

UPD. I just made a test with only CR. Python 2 fails, Python 3 works well. Seems it's a very rare case.

c-blake commented 9 years ago

On my Linux system, this program runs as fast as the C analogue:

proc os_getline(pcline: ptr cstring, nAlloc: ptr int,
                stream: File): int {.importc: "getline", header: "<stdio.h>".}

when isMainModule:
  proc main() =
    var counter: int = 0
    var cline: cstring
    var nAlloc: int
    while true:
      var length = os_getline(addr(cline), addr(nAlloc), stdin)
      if length == -1: break
      inc(counter)
    echo counter
  main()

jyapayne - the fgets approach is tricky in that to avoid arbitrary limits you need to figure out how large a buffer to use and strlen/memchr/etc for the end of the line which is not returned..potentially growing the buffer / using secondary buffers as needed until the actual end of line is found. With POSIX 2008+ getline you can do something like this which should work for all but the more general MacOS9 and earlier \r only or \r\n delimiting:

iterator getLines*(stream: File): string {.inline.} =
  proc os_getline(pcline: ptr cstring, nAlloc: ptr int,
                  stream: File): int {.importc: "getline", header: "<stdio.h>".}
  proc free(pointr: cstring) {.importc: "free", header: "<stdlib.h>".}
  var cline: cstring
  var nAlloc: int
  while true:
    var length = os_getline(addr(cline), addr(nAlloc), stream)
    if length == -1: break
    if length > 1 and cline[length - 2] == '\r':
        cline[length - 2] = '\l'
        cline[length - 1] = '\0'
        dec(length)
    yield $cline  # Emits cstrToNimStr with an unneeded strlen; yield toNimStr(cline, length) better
  free(cline)

when isMainModule:
  proc main() =
    var counter = 0
    for line in getLines(stdin):
      inc(counter)
    echo counter
  main()

This does take almost 2X as long as just counting lines and discarding them, but that is all just conversion to GC'd Nim strings.

jyapayne commented 9 years ago

@scriptum Ah, I see. I didn't realize that the old Mac OS used only CR to delimit lines. Then yes, my solution wouldn't work in that case. Your test with python 2/3 is interesting, though. Is the python 2 version faster than the python 3 version?

@c-blake your solutions are very similar to what I came up with and are very fast. However, you bring up a good point about fgets not being able to easily handle arbitrary lengths. That is indeed a problem and would impact performance to try and work around.

I realize getline isn't available on Windows, so to use any solution involving that, we'd have to rewrite that function to be cross platform or expect the user to have a GNU c compiler, the latter of which is unreasonable for a simple lines iterator. It's a tricky problem. If the Python 3 C source for handling lines is fast, we could maybe copy that?

c-blake commented 9 years ago

A few points that may be worth noting here, in no particular order.

"getline()" is usually just a thin wrapper around "getdelim()" standardized at the same time. getdelim() can also handle the purely '\r' delimited case, but only when specifically instructed, not as the current very flexible Nim reader does it. Calling the C getline()/getdelim () is the only way to really match C performance of those calls without having our own buffered IO Nim libs around low-level file handles/descriptors. Even doing fread()/etc. you will have extra copies, memchrs and such - honestly all just about as complex as buffered IO libs in Nim in the first place - yet slower. Also, these are about the C runtime library, not the gcc C compiler, and getdelim/getline are available all ove the place not just Linux - IBM's AIX, BSD's, etc.

Future availability of getline()/getdelim() as POSIX specified on Win32/Win64 is possible, but unlikely. It's been over 7 years. Also, consider the way that pair of calls is factored where getdelim() takes a 'char' not a string. That basically assumes lines are delimited by single chars not strings like "\r\n". The latter is actually how Windows delimits lines in text files. Even so, on Unix getline() retains the delimiter and would just end up keeping both delimiters on Windows. So, it's not completely crazy names/semantics on Windows, and in an ideal world they would add it tomorrow, but it doesn't seem to be a priority for Microsoft. Even if Microsoft did add it tomorrow, though, there would still be many legacy versions of Windows Nim would probably like to support for a long time.

You might think from all this The Right Way to go is doing your own buffered IO libs in Nim. Doing that, though, "stdin" in Nim (or stdout/stderr/whatever) will not share buffering with any C stdio calls. That can also be inconvenient in terms of interoperability with other wrapped C calls, for example. I don't really see a perfect solution here.

If you gave up on archaic MacOS 9 formats for Nim's readLine and just went with \n or \r\n as per my sketch above, then the "when POSIX" version could at least be as fast as C, and only the "when Windows|Other" version would be slower. It would make most sense to make that Other solution less flexible so that the Nim code was functionally identical on both platforms. That might be an ok tradeoff. A single "tr '\r' '\n'" converts the archaic MacOS format to one still in modern use, namely Unix. Anyway, a downside there is just "accepting" the slower than fgets performance on Windows if Microsoft never steps up { which doesn't impact me, but I can see how it might bother others }.

Another alternative is that to have a "bounded size input" iterator/readLine that just called fgets and truncated too long lines. I think that could be made basically as fast as C (maybe just one more strlen) and only use standard calls, but just have the "so last millenium" arbitary limit property.

Personally, I always run on Unix and I just use my own getLines() iterator when performance matters. Given how frequently this issue seems to arise (a few times per year or so...?), it may make sense to have something like getLines() in the standard library extended with a not as fast "portable" branch..and maybe give it a name/place to not suggest it is as flexible as the current readLine()/lines().

scriptum commented 9 years ago

@jyapayne python results are interesting: python 3.4: 0.224 python 2.7: 0.98

@c-blake I think fread + memchr should be portable enough, isn't?

c-blake commented 9 years ago

Sure, fread + memchr or repeated fgets are all plenty portable...There's just a trade off between portability and maximum performance due to the doubled up buffer management, memory copies, etc (inside the FILE* handling and outside). How bad that performance hit is surely depends a lot on details both of implementation and inputs (e.g., consistency of line lengths, etc.). I surely don't mean to discourage any particular attempt/approach.

scriptum commented 9 years ago

fread (on Linux at least) uses ultra-fast mmap with MAP_POPULATE flag. Actually, wc, which gives best result (0.019 for this test data), uses fread: https://github.com/coreutils/coreutils/blob/master/src/wc.c#L274

c-blake commented 9 years ago

Sure, sure..mmap() is the best. That particular optimization relies upon the stream being backed by a real seekable file, not say an input pipe...So, "wc < file" will be faster than "cat | wc", for example. Extra work (and time) not dissimilar to what "|wc" has to do over and above "wc<", however, will also be in a Nim layer on top of an fread() like interface (allocations, copying, memchrs, etc). On that note, don't forget when making comparisons that counting lines is less work than building strings to be used somehow, possibly with long lifetimes.

It would indeed be interesting to see if you could engage in similar optimization in Nim to the glibc fread() optimization you mention -- have a lines()-like iterator mmap() a whole file read only, frame lines via memchr('\n')s, and create Nim strings from pointers to line beginnings all with zero copies (somehow!).

I'm not sure the GC interface and/or Nim string creation is flexible enough to allow that, though { either at the moment or ever }. At the very least, the usual Nim string ending in a NUL '\0' byte (also for ease of C interoperability) would suggest you would need at least one strdup()-like work per line. You could code up a new kind of string in Nim, UnterminatedString or maybe MemRegion, say, that might be able to be constructed from regions of a file map. Also, Windows has memory mapped files. So, it's possibly unproblematic to make this at least somewhat portable.

Alternatively, if you mapped the file read-write-private and replaced \n with \0 you might be able to not modify the file, nor the existing string type, but have a lines()-like iterator that always took out the line ending (to let the \n go to \0). I expect that would have a large performance hit from page faults due to a copy-on-write impl of "private", though, probably nullifying other charms. An unterminated string approach is more likely to bear performance fruit for you if truly maximum performance is your aim.

scriptum commented 9 years ago

Alternatively, if you mapped the file read-write-private and replaced \n with \0 you might be able to not modify the file

Writable maps are two times slower in my tests on Linux (slower than fread + strdup) while fread has similar speed to mmap. BTW old kernels may not have some modern mmap features.

You could code up a new kind of string in Nim, UnterminatedString or maybe MemRegion, say, that might be able to be constructed from regions of a file map.

It sounds good, but different platforms has different mmap implementations depending on underlying kernel. Using libc's fread guarantees best possible file reading speed for any platform and not painful to implement. I'm not strong in Nim but take a look:

# as fast as wc -l

proc os_fread(buf: cstring, size, nmemb: csize, stream: File): csize
  {.importc: "fread", header: "<stdio.h>".}
proc os_memchr(buf: cstring, ch:cint, n: csize): cstring
  {.importc: "memchr", header: "<string.h>".}

when isMainModule:
  proc main() =
    var
      buffer: array[16*1024, char]
      bytes: csize
      i: int
      line = ""
    while true:
      bytes = os_fread(buffer, 1, sizeof(buffer), stdin)
      if bytes == 0:
        break
      var
        p = cast[int](buffer)
        e = cast[int](p)+bytes
      while true:
        p = cast[int](os_memchr(cast[cstring](p), ord('\l'), e - p))
        if p == 0:
          break
        inc i
        inc p
    echo i
  main()

I think this could be easily adapted for line reading. One may add another memchr for searching '\r' and it will have documented behavior. I'm too tired today but later I could contribute accelerated line reading based on wc source.

c-blake commented 9 years ago

Using libc's fread guarantees best possible file reading speed for any platform and not painful to implement.

Microsoft's libc (or others) may well not do the fread-on-seekables memory map optimization (now or ever), but it could still be done using some when Windows use FileMapping type switch.

That caveat made, I don't want to discourage your fread()ing which would always have been necessary for unseekable inputs anyhow. It is also almost surely faster than the current Nim impl, and I bet Araq is open to pull requests on this if you preserve the functionality and benchmark the advantage. Besides the memchr for \r you mention, you will also need some looping and buffer management to accept lines of arbitrary length (not just less than sizeof(buffer) whatever that limit is), and also some string creation. That will all add some time, as it must. Good luck!

c-blake commented 9 years ago

I agree with the approach def took in his patch, though it will be up to Araq how much backward feature compatibility he wants in terms of \r-delimited files. I was going to comment here that fgets rather than fread would be needed (for any case where you are not necessarily going to parse the whole file all at once).

@scriptum - if you really want/need "as fast as wc -l" then I think you need to give up on the abstraction of C's stdio streams and of always yielding a new Nim string. This does not mean you cannot still be usefully abstract. For example, the below library/program defines an iterator that I was able to time at within 1% (i.e. 1.009X run time) of wc -l on my Intel Sandy-Bridge E system. With only slightly more sophistication on the part of the (presumably performance ambitious) caller, it is easy with this approach to have the caller only conditionally create Nim strings, perhaps after easy preliminary filtering like checking a line's prefix or something.

import memfiles

type Record {.unchecked.} = object
  Beg*: pointer
  Len*: int

iterator records*(path: string, delim = '\l'): Record {.inline.} =
  template `+!`(p, i): expr = cast[type(p)](cast[int](p) +% i)
  proc c_memchr(cstr: cstring, c: char, n: csize): cstring {.
       importc: "memchr", header: "<string.h>" .}
  var rec: Record
  var mfile = memfiles.open(path)
  var remaining = mfile.size
  var End: int
  rec.Beg = mfile.mem
  while remaining > 0:
    End = cast[int](c_memchr(cast[cstring](rec.Beg), delim, remaining))
    if End == 0:                # Final delimited record is unterminated
      rec.Len = remaining
      yield rec
      break
    rec.Len = End - cast[int](rec.Beg)
    yield rec
    rec.Beg = rec.Beg +! (rec.Len + 1)
    remaining -= rec.Len + 1
  close(mfile)

proc toString*(rec: Record) : string =
  proc toNimStr(str: cstring, len: int): string {. importc: "toNimStr" .}
  result = toNimStr(cast[cstring](rec.Beg), rec.Len)
  result[result.len] = '\0'

when isMainModule:
  import os
  proc main() =
    var counter = 0
    if paramCount() > 0:
      for rec in records("/proc/self/fd/0"):     # Linux specific /proc/self trick to mmap stdin
        inc(counter)
    else:
      for rec in records("/proc/self/fd/0"):
        discard toString(rec)
    echo counter
  main()

The program can be run either with no arguments or any argument. When compiled and run as "./records just_the_count", it is about 3X faster than a run where the GC'd strings are created (for files fully buffered in RAM on glibc 2.21, and file sizes that fit in the L3 cache). The fast version is also about 3.5X faster than def's new proposal (but the slow version is only like 20% faster than his).

My profiling suggests that the memchr()s alone is about 75% of the run time for the fast version. So, barring improvements in the SSE code for that, it doesn't seem like it can get much better. Anyway, as mentioned wc -l fundamentally does less work than a readLine or a lines iterator does/should do. These are just some numbers to ballpark such in a Nim context, and also a module you might find useful.

c-blake commented 9 years ago

Oh, and of course one could always have the records iterator take a MemFile instead of a path. That would be nice for callers that wanted to manage memory permissions and/or the lifetime of the validity of pointers in the returned records. It might be worthwhile adding something like that iterator to lib/pure/memfiles.nim so that in the future if someone complains about input performance we can say, "well, if you know you have seekable input just import this and go and maybe apply toString()...". It would not be so hard to add the \r\n rules, or indeed delimiting by any substring like ":MY VERBOSE SEPARATOR:", or perhaps just have several versions of the iterator for various styles of record delimiting. Performance for longer delimiters is usually optimized by using the fastest bandwidth routine (like memchr) to find the character in the substring which is least likely to be internal to a record. The last property is obviously data specific, but passing it as an additional argument could be a performance-smart API.

If you have repeatedly used data files and really need to drop data load times to truly near zero (at least constant on the order of 10s of microseconds) you can "save the answer". E.g, you can create a simple command that builds a helper file of line start offsets. This can do the memchr()s or substring searches "once and for all". The index file can be binary ints if you don't care about endianness/portability or it can be some fixed sized ASCII representation (equal sizes makes random access easy) with a tiny bit of parsing/printing overhead. You just need some naming convention for the index files like file pairs with /tmp/original/path/original/file or maybe file pairs with a dotfile /original/path/original/.file or something. If input files only change in an append-only sense like logs, mailboxes or such then it's also easy to organize the indexer program to check timestamps and not do anything or check file sizes and the final offset and only process the new data to update the index file. Of course, this approach creates file management complications and is perhaps too application-specific to have in the Nim stdlib. All told, it can make analyzers of log-like things lightning fast, though.

scriptum commented 9 years ago

@c-blake of course for complicated and particular case I would better to use platform dependent solution with mmap and so on. But I tried to use trivial naive line reading (to parse something) and faced with surprising slowness:

not optimizing for line in stdin.lines into some form of var line; for readLine(stdin, line) - one string copy per line (if strings are immutable, could it be optimized?)
making another string copy if not in proc - not graceful code
slowest possible solution - reading by char

It would be nice to achieve at least C performance with getline/getdilim which makes dynamic string buffer. It would be perfect to eliminate string allocation if string not used bot not necessary.

c-blake commented 9 years ago

It's not unreasonable to want simple stuff everyone uses to be pretty fast and readily available. The slowest possible/reading char-by-char solution has elicited complaints for a long time. It recently got quite a bit faster (priorly using the locked getc all the time).

Some abstractions do "cost something", though. The iterator getting separate string objects for each loop vs the "var line; while readLine" approach is like that - the while loop is explicitly re-using a buffer while the iterator provides "autonomous values". Nim '=' assignment can be another gotcha there. The right answer is usually a "tower" of interfaces of different levels of abstraction.

Zero/fewer copy interfaces usually burden callers at least a little more, though. It's not much more work/thought to write the while loop if you want to peformance tune. It's not that much more work again to just use 'for rec in records(memfiles.open("/dev/stdin")): maybe(toString(rec))'.

With the last can get you basically wc -l-like performance and flexibility (assuming my pull request https://github.com/nim-lang/Nim/pull/3146 gets accepted in some form or another). The only caveat is you need a real file, not a pipe or tty. Well, there are other caveats like controlling lifetime of the MemFile, doing toString for anything you want to keep or use Nim-esque APIs on, etc.

scriptum commented 9 years ago

Here is a buffered reading with documented CR/LF support. It's messy and needs polishing but I want to show principle. Same speed as C (even with string construction):

proc os_fread(buf: cstring, size, nmemb: csize, stream: File): csize
  {.importc: "fread", header: "<stdio.h>".}
proc os_memchr(buf: cstring, ch:cint, n: csize): cstring
  {.importc: "memchr", header: "<string.h>".}
proc c_memcpy(a, b: pointer, size: int) {.header: "<string.h>", importc: "memcpy".}

when isMainModule:
  template `$$`(s): cstring = cast[cstring](s)
  template `$$`(s: string): cstring = cstring(s)
  template `$!`(s): int = cast[int](s)
  proc native() =
    var i = 0
    for line in stdin.lines:
      i.inc
    echo i
  proc main() =
    var
      buffer: array[16*1024, char] # fastest on most modern CPUs, fits L1 cache
      bytes: csize
      i: int
      line = ""
      lastCR = false
    while true:
      bytes = os_fread(buffer, 1, sizeof(buffer), stdin)
      if bytes == 0:
        break
      var
        p = $!(buffer)
        e = $!(p) + bytes
      if lastCR == true and buffer[0] == '\l': # handle CR + LF on buffer edge
        inc p
      lastCR = false
      while true:
        var
          ee = e
          lineptr, buflen: int
        let
          bufptr = p
          p1 = $!(os_memchr($$(p), ord('\l'), e - p))
        if p1 != 0:
          ee = p1
        # search CR before LF
        let p2 = $!(os_memchr($$(p), ord('\c'), ee - p))
        lineptr = $!($$(line)) + len(line)
        if p1 != 0: # LF found
          if p2 == 0: # CR not found
            buflen = p1 - p
            p = p1
          else: # CR found - there are two possible variants
            buflen = p2 - p
            if p2 == p1 - 1: # CR + LF (Win)
              p = p1
            else: # CR somewhere before LF (weird old Mac mixed with LF)
              p = p2
        else: # LF not found
          if p2 != 0: # CR found
            buflen = p2 - p
            p = p2
            if p2 == e - 1: # CR at end of buffer - maybe this is part of CR + LF
              lastCR = true
          else: # CR not found - buffer overflow
            buflen = e - p
            line.setLen(len(line) + buflen)
            # append buffer to string
            c_memcpy($$(lineptr), $$(bufptr), buflen)
            break
        if buflen > 0:
          line.setLen(len(line) + buflen)
          c_memcpy($$(lineptr), $$(bufptr), buflen)
        # echo line
        line.setLen(0)
        inc i
        inc p
    echo i
  main()
  # native()

I hope this will help with #3143

jlp765 commented 9 years ago

Using devel branch 0.11.3, Nim is faster than python on my win7 machine. (You do need to compile with -d:release) I think this is no longer an issue.

scriptum commented 9 years ago

@jlp765 this is because of recent #3143. Since it solves #3084 it may be closed.

ahcm commented 5 years ago

On Linux and Mac OS nim 0.20.2 is slower than python. The nim / C =~3.75, longer lines seem to make it worse even.

def- commented 5 years ago

Try -d:danger in order to prevent runtime checks.

nim-lang / Nim

Slow lines reading (even slower than python) #3084