rcedgar / urmap

URMAP ultra-fast read mapper
GNU General Public License v2.0
39 stars 11 forks source link

stdin for -map not working due to unnecessary optimizations #10

Closed dogarithm closed 2 years ago

dogarithm commented 2 years ago

https://github.com/rcedgar/urmap/blob/836dd6fe19309509d721c4164408f00d303a3c9c/src/myutils.cpp#L516

This function appears to avoid using fread(Buffer, 1, Bytes, f) in favor of using fread(Buffer, Bytes, 1, f). Doing it this way means that the number of bytes read is not returned; the function calculates this using the relative difference of file position via ftello(). However:

1) There is no performance benefit to doing it this way in any environment that I know of. 2) ftello() will not work on a stream such as STDIN.

I therefore recommend something like:

uint32 ReadStdioFile_NoFail(FILE *f, void *Buffer, uint32 Bytes)
        {
        asserta(f != 0);
        size_t ElementsRead = fread(Buffer, 1, Bytes, f);
        uint32 BytesRead = uint32(ElementsRead);
        IncIO(f, BytesRead);
        return BytesRead;
        }

There are many places where this "optimization" pattern is used and I defer to your judgement to address this issue. Thanks.

rcedgar commented 2 years ago

You are exactly right. This is an unintended consequence of sub-optimal implementation of some old subroutines. From my perspective, it is not a priority to optimize input from a pipe, saving to a temporary file is a reasonable work-around.

dogarithm commented 2 years ago

I understand. If, on our end, we feel that it is important to support pipe input, we will submit a pull request. One possibly interesting use case is online mapping. In this mode of operation, new map requests can be submitted on-the-fly to a long-lived or persistent urmap process (just by feeding it another input record), amortizing the startup cost and providing real-time and parallelized map responses. With the pipe input fixes (*), this seems to work pretty well. I believe this mode of operation may represent a significant upgrade to urmap's capabilities, and cannot be supported via the workaround you described.

(*) along with some modifications to the line reading code, basically just replacing it with fgets instead of using custom buffering