Perhaps we can farm out the work of splitting up the input file

mdelaurentis commented 12 years ago

This is just an idea for a possible future enhancement. It looks like we're currently splitting up the input reads files in one process. It seems like this takes a couple hours for a large job. I'm not completely sure if this would work, but I wonder if we could instead break it up as follows:

Estimate the size of each chunk by dividing the size of the input file by the number of chunks.
Come up with an approximate region of the input file(s) for each worker to process, defined as a byte offset into the file and a length in bytes.
Submit a qsub job for each chunk that includes the file to split and the approximate offset and length to work on.
Each worker carves out its own section of the input file to work on. The byte offsets we calculated in step 2 will be approximate, so would have to make each worker seek to the offset in the file and then backtrack until it finds the start of a read.

I suppose we would need to make sure that corresponding forward and reverse reads make it into the same chunk. I'm not sure exactly how that would work. If the forward read and reverse read start at the same byte in their input files, then this would be easy.

We could also consider starting to process each chunk as it is created, rather than waiting until the whole file is split up to start processing any of the chunks.

greggrant commented 12 years ago

Good ideas! I believe it already uses file size to estimate chunk size. It used to do a "wc -l" but that alone took hours on these enormous files so the awkwardness of the current code reflects the need to pre-process with only one pass through the input file(s) (the earliest versions of RUM did four passes).

The forward and reverse are at the same byte offset, however another change I have in mind is to allow for different length forward and reverse reads, which would break that. But it can have multiple strategies and revert back to the old one if the new one won't work on a particular data set. That's what it does now if the reads have variable lengths. In that case it reverts back to the version that does multiple passes to assess and break them up.

But I like the idea of giving each chunk just a byte offset and it carves out what it needs from the input files(s) so that each chunk can start right away without waiting hours for just one node to break it up. That sounds like a relatively easy modification. But we should not hold up v1.11 for that, we can do it for v1.12 eh?

Thanks, Greg

On Tue, 28 Feb 2012, Mike DeLaurentis wrote:

This is just an idea for a possible future enhancement. It looks like we're currently splitting up the input reads files in one process. It seems like this takes a couple hours for a large job. I'm not completely sure if this would work, but I wonder if we could instead break it up as follows:

Estimate the size of each chunk by dividing the size of the input file by the number of chunks.

Come up with an approximate region of the input file(s) for each worker to process, defined as a byte offset into the file and a length in bytes.

Submit a qsub job for each chunk that includes the file to split and the approximate offset and length to work on.

Each worker carves out its own section of the input file to work on. The byte offsets we calculated in step 2 will be approximate, so would have to make each worker seek to the offset in the file and then backtrack until it finds the start of a read.

I suppose we would need to make sure that corresponding forward and reverse reads make it into the same chunk. I'm not sure exactly how that would work. If the forward read and reverse read start at the same byte in their input files, then this would be easy.

We could also consider starting to process each chunk as it is created, rather than waiting until the whole file is split up to start processing any of the chunks.

Reply to this email directly or view it on GitHub: https://github.com/PGFI/rum/issues/11

mdelaurentis commented 12 years ago

Thanks for the feedback. I'm definitely not trying to squeeze this into 1.11, I just wanted to mention it so I don't forget it, and maybe it will be worth doing in 1.12 or something.

On Tue, Feb 28, 2012 at 2:28 PM, greggrant < reply@reply.github.com

wrote:

Good ideas! I believe it already uses file size to estimate chunk size. It used to do a "wc -l" but that alone took hours on these enormous files so the awkwardness of the current code reflects the need to pre-process with only one pass through the input file(s) (the earliest versions of RUM did four passes).

The forward and reverse are at the same byte offset, however another change I have in mind is to allow for different length forward and reverse reads, which would break that. But it can have multiple strategies and revert back to the old one if the new one won't work on a particular data set. That's what it does now if the reads have variable lengths. In that case it reverts back to the version that does multiple passes to assess and break them up.

But I like the idea of giving each chunk just a byte offset and it carves out what it needs from the input files(s) so that each chunk can start right away without waiting hours for just one node to break it up. That sounds like a relatively easy modification. But we should not hold up v1.11 for that, we can do it for v1.12 eh?

Thanks, Greg

On Tue, 28 Feb 2012, Mike DeLaurentis wrote:

This is just an idea for a possible future enhancement. It looks like we're currently splitting up the input reads files in one process. It seems like this takes a couple hours for a large job. I'm not completely sure if this would work, but I wonder if we could instead break it up as follows:

Estimate the size of each chunk by dividing the size of the input file by the number of chunks.

Come up with an approximate region of the input file(s) for each worker to process, defined as a byte offset into the file and a length in bytes.

Submit a qsub job for each chunk that includes the file to split and the approximate offset and length to work on.

Each worker carves out its own section of the input file to work on. The byte offsets we calculated in step 2 will be approximate, so would have to make each worker seek to the offset in the file and then backtrack until it finds the start of a read.

I suppose we would need to make sure that corresponding forward and reverse reads make it into the same chunk. I'm not sure exactly how that would work. If the forward read and reverse read start at the same byte in their input files, then this would be easy.

We could also consider starting to process each chunk as it is created, rather than waiting until the whole file is split up to start processing any of the chunks.

Reply to this email directly or view it on GitHub: https://github.com/PGFI/rum/issues/11

Reply to this email directly or view it on GitHub: https://github.com/PGFI/rum/issues/11#issuecomment-4223380

mdelaurentis commented 12 years ago

Or as Greg suggests:

Since it can take a long time to split the files, might be good to qsub each chunk as its files are finished, rather than waiting till all the splitting is done and qsubbing everything at once.

itmat / rum

Perhaps we can farm out the work of splitting up the input file #11