schveiguy / iopipe

D language library for modular io
Boost Software License 1.0
77 stars 6 forks source link

iopipe fails with "Program exited with code -9" counting lines in file #32

Closed abrown25 closed 4 years ago

abrown25 commented 4 years ago

With this file: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz the following code fails:

!/usr/bin/env dub

/+ dub.sdl: name "hello" dependency "iopipe" version="~>0.2.0" dependency "io" version="*" +/

import std.stdio; import std.typecons; import iopipe.textpipe; import iopipe.zip; import iopipe.bufpipe; import std.io : File = File;

void main() { auto counter = 0;

foreach (line; File("ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz").refCounted.bufd.unzip(CompressionFormat.gzip).assumeText.byLineRange!false) { counter++; }

writeln(counter); }

schveiguy commented 4 years ago

Hm... when I test on my machine, it hangs. Doesn't even count one line.

abrown25 commented 4 years ago

Thanks very much for looking into this. If I replace the iopipe version with: dependency "iopipe" version="~>0.2.0" then it works:

Performing "debug" build using dmd for x86_64. io 0.2.5: target for configuration "library" is up to date. iopipe 0.1.7: target for configuration "library" is up to date. hello ~master: building configuration "application"... Linking... To force a rebuild of up-to-date targets, run again with --force. Running ./hello 1103800

which matches python: python -c "import gzip; print(sum(1 for x in gzip.open('../ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz', 'rt')))" 1103800

I'm running this on Ubuntu 20.04 if that helps explain why I get a -9 exit code and it hangs for you. It also doesn't need to read the whole file to fail. I get the same outcome if I put a break in after reading 5 lines.

schveiguy commented 4 years ago

It's not reading the file properly. I'm still trying to figure out what's happening, it never finds a line end, and so the system runs out of memory.

schveiguy commented 4 years ago

It's a problem in the line counter. nothing to do with zip, which seems to be reading the data just fine. I had to do a LOT of refactoring to get it to work for @safe.

schveiguy commented 4 years ago

You are going to laugh, when you see the mistake I made (and it took me a while to figure out why my unittests didn't catch it). PR incoming.

abrown25 commented 4 years ago

I would never laugh at the man who's making it possible for me to do my job! Thank you, it works perfectly now. Would you be interested in a pull request for the examples folder? A simple example of iterating over a gzipped text file (name taken from the arguments) may be useful to people processing files line by line?

schveiguy commented 4 years ago

making it possible for me to do my job!

Great to know this project is helping you! Let me know if there's anything else that you find.

I did note that gzcat | wc -l works much faster than your sample. But I think that's more to do with zlib than with iopipe (See #14)

A simple example of iterating over a gzipped text file (name taken from the arguments) may be useful to people processing files line by line?

Hm... I wonder if just adding a parameter to the byline range example to indicate the input is a zipped stream? It would also showcase how straightforward it is to construct pipes to handle many cases with one implementation.

Either way, I would be glad to merge something from you.