chunk HiC mapping step - Githubissues

phasegenomics / FALCON-Phase

FALCON-Phase integrates PacBio long-read assemblies with Phase Genomics Hi-C data to create phased, diploid, chromosome-scale scaffolds

Other

74 stars 17 forks source link

chunk HiC mapping step #57

Closed iggyB closed 4 years ago

iggyB commented 5 years ago

Hej,

Nice job on Falcon-Phase!

Mapping HiC reads is a step that takes quite some time. This could be rather easily improved for those who run pipeline on a cluster by splitting HiC reads into chunks. There are a number of tools that can do it, but even a simple split does the job.

I did it by simply getting read pair number, dividing it by njobs (or number of available nodes) to get approx chunk size and using it to split reads: zcat reads.R1.fastq.gz | split -l 40000000 - reads.R1.10M zcat reads.R2.fastq.gz | split -l 40000000 - reads.R2.10M Then it was just to submit separate mapping jobs and merge BAM files to get "aln.unfiltered.bam".

This can be easy integrated (or done in more neat way) into your workflow, especially pb-assembly branch. The machinery is there and you even use it when generating "haplotig.placement" file by running multiple mummer jobs.

Cheers, Iggy

zeeev commented 5 years ago

Hi @iggyB,

Thank you for the kind words, we are happy people are using the code! First let me offer up the simplest solution to long mapping times, increase the number of cores available to bwa-mem (in the config.json). That being said, your solution is optimal. I've tagged this issues as an enhancement. My development time is limited, but I'll see what I can do to get this implemented. If you end up coding the method I'd be happy to take a pull request.

Best,

Zev

iggyB commented 5 years ago

Hej @zeeev,

It's a big step towards (nearly) fully phased polyploid assemblies. Perhaps not that many projects posses required data sets, but with technology and methods spreading, I'm sure more and more people will be interested in running Falcon-Phase.

I did change core number - was of course obvious thing to do :) But then I anyway decided to create alignment outside the pipeline. One more suggestion: different steps in pipeline should be configurable to use different amount of resources (like in Falcon/Falcon-Unzip).

Time is an issue. I'll be happy to share the code if I get it done before you do :)

Cheers, Iggy

shawnpg commented 4 years ago

We haven't gotten to this and it hasn't come up again, so we will close this as we do not currently plan to implement chunking natively (which would be required in some fashion to parallelize the other major time sink, the phase command. If there are any other requests for this feature, please reactivate this issue.