statgen / bamUtil

http://genome.sph.umich.edu/wiki/BamUtil
87 stars 29 forks source link

bam recab: parallel processing #21

Open andreas-wilm opened 8 years ago

andreas-wilm commented 8 years ago

Hello there and thanks for bamUtil.

I have an enhancement request: when using bamUtil for massaging of BAM files one real bottleneck seems to be the recalibration step (bam recab), because it cannot be run on multiple cores. At least conceptually it looks like it should be straightforward to collect the required information per region (running the analysis in parallel for many regions) and then aggregate at the end to give the final table. Applying that table can be a single core process of course.

Are there any plans to implement something along these lines? Or is something similar already implemented somewhere and I missed it in the documentation?

Thanks Andreas

mktrost commented 8 years ago

Unfortunately, this feature is not yet implemented. It is an enhancement that we would like to make. We appreciate the enhancement request. Unfortunately I don't yet have a timeline for implementing this feature.

andreas-wilm commented 8 years ago

Maybe if you sketched out the overall procedure someone could volunteer (including me)? A couple of pointer to resp. functions and how collected results per region/chromosome could be merged into a final table would help, I guess.

mktrost commented 8 years ago

Help would be great. My hours are limited right now.

Here is some info on areas that will need updating/some ideas on how to do it.

I am likely missing some details, but this is my initial stab at where I would start and the updates that I know are necessary. I'm open to suggestions/recommendations/improvements/questions. Let me know if I missed anything or if it is unclear as to which class/function does a specific task.

You can either multi-thread the Build Table step or leave it as single thread and run multiple instances of Recab, one for each region.

Building the Table

The Recalibration table is built from line 249 to 293. The SamFile is opened for reading at line 249: https://github.com/statgen/bamUtil/blob/master/src/Recab.cpp#L249

After opening the file (line 249) and prior to looping through to read the file (line 268), you will need to set the region to be read using: samIn.SetReadSection(regChr, regStart, regEnd); https://github.com/statgen/libStatGen/blob/master/bam/SamFile.h#L278 or if regions are limited to an entire chromosome, you can use: samIn.SetReadSection(regChr); https://github.com/statgen/libStatGen/blob/master/bam/SamFile.h#L246

If your region contains start/end positions and is not an entire chromosome, some reads will cross regions, so will be processed twice. You don't want them to be counted multiple times, when generating the table, so processReadBuildTable will need to be updated to limit it to only process positions within the region being processed. To do this, you can check the reference position that is calculated at line 566 and "continue" (skip processing that position) if it is out of the region. https://github.com/statgen/bamUtil/blob/master/src/Recab.cpp#L566

If you decide to implement as multi-threaded, you will need to update processReadBuildTable to not use statics and make HashErrorModel::setCell thread safe such that multiple threads could call it or update the code to store a hashErrorModel for each thread and merge. HashErrorModel is the class used to store and write the table. For the multi-threaded solution, you could probably run modelFitPrediction after the threads have all completed on a single/merged hashErrorModel.

If you run single threaded with multiple calls to Recab, you will need to update modelFitPrediction to write different files for each region (maybe just append the region to the outBase as you pass it in). You will also need to make sure that the writing to the Logger will work with the multiple regions and not overwrite or collide with each other.

Applying the Table

Really the only change to this would be if you had to first merge multiple tables/hashErrorModels. You would just need to do that prior to line 300. HashErrorModel is the class used to store and write the table. https://github.com/statgen/bamUtil/blob/master/src/HashErrorModel.h

If running multiple Recabs to generate the tables, they will need to be merged. How you pass the names of multiple files into apply table is up for discussion. Then you could just read the files, parsing them, and updating the tables in HashErrorModel.

Parameters

Make your parameter updates to Recab::usage() and the parameter additions in Recab::execute. Do not make them to the "RecabSpecificParameter" sections since those parameters are active when a user runs recalibration through the Deduper. Region processing for the deduper is more complicated and can be added afterwards.

If you are doing multi-threads, add the number of threads as an optional parameter.

If you are doing region specific build tables, you will need to add parameters for the region, to tell the deduper to just build the table, and to tell the deduper to just apply a table. Validation should occur to make sure the region parameter is only specified if "build table only" is also specified.

Multi-thread a single Recab Exe:

Multi-thread a single Recab exe so it has multiple threads to generate the recalibration table, then automatically merge the tables and apply as normal.

Run Multiple Single-Threaded Recab Exes:

Step 1) PARALLEL STEP: Multiple Recab exes, one for each region, to generate the table for each region Step 2) One Recab exe to merge all of the region specific tables into a single table Step 3) One Recab exe to apply the single table

Steps 2 & 3 could be done in the same call to recab. Required Recab updates: 1) ability to only generate a table 2) ability to limit to a specific region 3) ability merge recalibration tables - new logic, but could just read the multiple files and add to the internal recab table structure in memory 4) ability to just apply the table These updates are all potentially useful features for Recab regardless of parallel execution.

andreas-wilm commented 8 years ago

Wow! Just wow. As soon as this becomes a pressing issue for us, I'll give this a go! Should hopefully be easy with such detailed info.