itmat / rum

RNA-Seq Unified Mapper
http://cbil.upenn.edu/RUM
MIT License
26 stars 4 forks source link

chunk sizes hugely off with *.gz input #149

Open nmanik opened 11 years ago

nmanik commented 11 years ago

When *.fastq.gz files are used, RUM doesn't allocate the same number of reads to each chunk (instead of 5 million reads per chunk, I got only ~480K reads per chunk for all chunks, and last chunk having all the rest of the reads).

This code in parsefastq.pl works fine for .fastq input files, but not for .fastq.gz. As a temporary fix, I gunzip all input files before feeding in.

my $FL = `head -10000 $infile1 | wc -l`;
chomp($FL);
$FL =~ s/[^\d]//gs;

my $s1 = `head -$FL $infile1`;
my $s2 = `tail -$FL $infile1`;
my $totalsize = length($s1) + length($s2);
my $recordsize = $totalsize / ($FL / 2);
my $numrecords = int($filesize / $recordsize);
my $numrecords_per_chunk = int($numrecords / $numchunks);

On a tangetial note, there is no need for the tail command above - it offsets any time-savings from not having to read the whole file using head! Also "head -count" is probably deprecated and could be replaced by "head -n count".

mdelaurentis commented 11 years ago

Ah, that's a silly thing I missed. Sorry about that. I need to use the size of the uncompressed file, not the compressed file, when determining how big to make each chunk. That will be fixed in the next release.

You're right that the tail isn't really necessary, but I don't think it actually slows anything down appreciably. The tail command most likely seeks directly to some position near the end of the file. It probably doesn't actually read in the whole file.

On Mon, Oct 22, 2012 at 4:54 PM, nmanik notifications@github.com wrote:

When *.fastq.gz files are used, RUM doesn't allocate the same number of reads to each chunk (instead of 5 million reads per chunk, I got only ~480K reads per chunk for all chunks, and last chunk having all the rest of the reads).

This code in parsefastq.pl works fine for .fastq input files, but not for .fastq.gz. As a temporary fix, I gunzip all input files before feeding in.

my $FL = head -10000 $infile1 | wc -l; chomp($FL); $FL =~ s/[^\d]//gs;

my $s1 = head -$FL $infile1; my $s2 = tail -$FL $infile1; my $totalsize = length($s1) + length($s2); my $recordsize = $totalsize / ($FL / 2); my $numrecords = int($filesize / $recordsize); my $numrecords_per_chunk = int($numrecords / $numchunks);

On a tangetial note, there is no need for the tail command above - it offsets any time-savings from not having to read the whole file using head! Also "head -count" is probably deprecated and could be replaced by "head -n count".

— Reply to this email directly or view it on GitHubhttps://github.com/PGFI/rum/issues/149.

nmanik commented 11 years ago

Thanks! And no worries - the code is well-written and organized that it makes it easier to trace the cause of any off behavior. I have started RUM again now on a dataset (or ~180M reads) after gunzipping. I am keeping my fingers crossed and will keep you posted how it goes..

nmanik commented 11 years ago

Hi Mike, Just wanted to let you know that my runs completed successfully. I'm currently looking at the results.. Thanks for the software and for your help!

mdelaurentis commented 11 years ago

That's great. Thanks for reporting the issues you found, and for your patience as we continue to work through some of them.

On Thu, Oct 25, 2012 at 4:16 PM, nmanik notifications@github.com wrote:

Hi Mike, Just wanted to let you know that my runs completed successfully. I'm currently looking at the results.. Thanks for the software and for your help!

— Reply to this email directly or view it on GitHubhttps://github.com/PGFI/rum/issues/149#issuecomment-9792451.