nservant / HiC-Pro

HiC-Pro: An optimized and flexible pipeline for Hi-C data processing
Other
372 stars 181 forks source link

Request for promoter-capture Hi-C #440

Open adippolito12 opened 3 years ago

adippolito12 commented 3 years ago

Hey Nicholas,

Wonderful pipeline -- I appreciate the added NextFlow implementation.

I had a request/recommendation that may improve compatibility with promoter-capture Hi-C users: a utility script that converts the final valid pairs file to Chicago input format (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4908757/). The conversion is doable with some data munging, but a streamlined implementation would be super helpful for those who want to put together a HiCPro/Chicago workflow for interaction calling in PC-Hi-C.

Thanks! Tony

nservant commented 3 years ago

Hi Tony, Thanks for the suggestion. Could you give the definition of Chicago input files please ... so that I can see how difficult it would be ? Thanks

adippolito12 commented 3 years ago

Sure, Nicholas:

Chicago does have a script that will take a bam file and convert it to their input format, so it may suffice to just have a deduped, valid pairs file in bam format: https://bitbucket.org/chicagoTeam/chicago/src/master/chicagoTools/

Tony

nservant commented 3 years ago

Hi Tony, With the GET_PROCESS_SAM, HiC-Pro outputs a BAM file with a flag according to the pairs type. So it should be simple to extract the VI flag from this file, and give it to the Chicago script. Did you already test that ? Thanks

adippolito12 commented 3 years ago

Hey NIcholas,

I noticed that output option, but I wasn't sure if that included deduped pairs as well. Also, do you have a glossary of which flags correspond to which pair type?

Thanks!

nservant commented 3 years ago

Arfff yes, you're right, duplicates are not removed at this stage ... The information is stored in the XA flag with DE=dangling_end, VI=valid pairs, RE=religation and SC=self-circle.

On the link you sent, I do no see any clear description of the .chinput file format ? do you have it somewhere ?

adippolito12 commented 3 years ago

Apologies -- here's an example file: https://bitbucket.org/chicagoTeam/chicago/src/master/PCHiCdata/inst/extdata/GMchinputFiles/GM_rep1.chinput

It's a 5-column tsv (with a header line): bait fragment id, other end fragment id, # pairs supporting the interaction, length of other end fragment, and distance between the two. The ids refer to ids specified in a rmap file and baitmap file that are also used as inputs in the Chicago workflow. The rmap file is just an in silico digested genome with numerical ids assigned to fragments. Those same ids are used in the baitmap file that specifies the bait fragments.