ijuric / MAPS

18 stars 11 forks source link

Possible takes output from a HiC-Pro run for MAPS? #3

Open jphe opened 5 years ago

jphe commented 5 years ago

Hi,

I'm trying to use MAPS for my HiChIP data, and I have HiCPro output. I'm wondering dose it possible to use HiCPro output for MAPS? which is compatible with many other tools.

Thanks, Jphe

ijuric commented 5 years ago

Maybe. It's not part of MAPS pipeline, so you'll have to do some steps manually.

If my understanding of HiCPro is correct, it outputs *.*allValidPairs file which is in something similar to .bedpe format. I don't know what exact filtering HiCPro does. I am just assuming that .validPairs file contains both short reads (pair ends are within 1kb) and long reads (2 ends are more than 1kb away). I'm also assuming that chromosomes in .allValidPairs are called "chr1, chr2, chr3...", rather that "1, 2,3..." or something else. If chromosome names are different, you need to rename them to "chr1, chr2, ..."

step 1) Split .allValidPairs file into short (read ends less than 1kb appart) and long files, one for each chromosome (so, in the end you'll need to have 2*N files, 2 files per chromosome). Each of your short files need to be named : [DATASET]. chr[CHROMOSOME].shrt.vip.bed format is (tab delimited) chr\tstart\t\end\tother_stuff. (other stuff is just additional columns we don't use, and you don't need to have them) For example:

chr9 100472222 100472280 K00168:173:HKYTGBBXX:7:1101:1144:2809

chr17 69252083 69252964 K00168:173:HKYTGBBXX:7:1101:1144:8154 so, if your dataset is called ds, then your names would be ds.chr1.shrt.vip.bed, ds.chr2.shrt.vip.bed, ... You should have 2 coordinates for short reads in .allValidPairs, whicle shrt.vip.bed files contain one coordinate. You can use the midpoint of two start positions from .allValidPairs file.

Each of your long files needs to be named: [DATASET].chr[CHROMOSOME] .long.intra.bedpe format here is (tab delimited) chr1\tstart1\t\end1\tchr2\tstart2\t\end2\t other_stuff (other stuff is just additional columns we don't use, and you don't need to have them): For example:

chr17 44560182 44560233 chr17 44863639 44863690 K00168:173:HKYTGBBXX:7:1101:1144:5798 60 + -

so, if your dataset is called ds, then your names would be ds.chr1 .long.intra.bedpe, ds.chr2.long.intra.bedpe, ...

step 2) in your outdir ( dir specified in run_pipeline.sh script under outdir), create folders called feather_output/[DATASET] and MAPS_output/[DATASET] (so feather_output/ds and MAPS_output/ds ) and move long and short files in feather_output/[DATASET] dir

step 3) in run_pipeline.sh script, set feather=0 and run it.

That should work, but I haven't tried it. Send me an email ( ivan.juric.gen@gmail.com) if you get errors (I reply to those faster, coz sometimes I don't get notifications about posts on github)

Few additional comments:

Hope this helps, Ivan

On Wed, Dec 12, 2018 at 6:49 AM jphe notifications@github.com wrote:

Hi,

I'm trying to use MAPS for my HiChIP data, and I have HiCPro output. I'm wondering dose it possible to use HiCPro output for MAPS? which is compatible with many other tools.

Thanks, Jphe

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ijuric/MAPS/issues/3, or mute the thread https://github.com/notifications/unsubscribe-auth/ALPpZcpaWztj2uOtMgp5kMLWtgL0F1rVks5u4O0ygaJpZM4ZPSFh .

gouthamatla commented 5 years ago

The format of HiC-Pro validpairs: read_name chr_reads1 pos_reads1 strand_reads1 chr_reads2 pos_reads2 strand_reads2 Otherstuff

D00733:341:CCHYUANXX:1:2315:16620:90410 chr1    13512   -   chr1    45625736    +   199 HIC_chr1_8  HIC_chr1_127410 31  42  
D00733:335:CCCTEANXX:8:1206:4664:49125  chr1    13531   -   chr1    714897  -   232 HIC_chr1_8  HIC_chr1_1395   31  30  
D00733:341:CCHYUANXX:1:2114:8240:60547  chr1    13553   -   chr1    6052645 +   296 HIC_chr1_8  HIC_chr1_13399  30  42  
D00733:341:CCHYUANXX:1:1206:15851:59665 chr1    13825   +   chr1    77225986    +   185 HIC_chr1_10 HIC_chr1_206055 31  42  
D00733:335:CCCTEANXX:8:2110:15873:56099 chr1    14767   +   chr1    111149270   +   310 HIC_chr1_12 HIC_chr1_285570 30  42  

so to create shrt.vip.bed, I could just take column 2 3 6 other stuff to make shrt.vip.bed file if distance between 3 and 6th column is less than 1kb.

I did not understand how to create long.intra.bedpe. As mentioned above, it requires chr1 start1 end1 chr2 start2 end2. I am wondering what is the end1 and end2 ? from HiC pro, all we have is the start and end position of one read pair. how should I create long.intra.bedpe` from this ?

jphe commented 5 years ago

Hi, Ivan.

Thanks for your reply.

Actually in the HiC-Pro output validpairs, it has no PETs that distances less than 1kb as it has already filtered by default. One thing I’m not clear is, why MAPS need the reads that less than 1kb distances, as it self-ligation that need to be removed for usual HiC pipelines. For the PETs that distance less than 1kb it was stored in the DumpPairs, DEPairs REPairs, SCPairs and SinglePairs( some of them maybe empty depends on your settings). So I merged all HiCPro output Pairs into one like the format you mentioned.

For HiCPro, it only record the start site of the reads, so for the end1 and end2, I just add the sequence length (for mine is 150bp) from the start site, just like the code bellow.

array=(chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chrX chrY) for k in ${array[@]} do for f in ../raw.bed do bf=basename $f tt=echo $bf| sed -r 's/.bed//g' if ! [ -f $tt.$k.bedPE ] then echo $tt.$k cat $f |awk -v chr=$k '{OFS=FS="\t"}{ if ($2==chr && $5==chr && $6-$4<1000 ) print $2,$3,$6}' | sort -k1,1 -k2,2n | uniq > $out.shrt.vip.bed # awk '!x[$0]++' cat $f |awk -v chr=$k '{OFS=FS="\t"}{ if ($2==chr && $5==chr && $6-$3>=1000 ) print $2,$3,$3+150,$5,$6,$6+150 }' | sort -k1,1 -k2,2n | uniq > $out.long.intra.bedpe fi done done

Jiangping He

在 2018年12月24日,下午8:29,Goutham <notifications@github.com mailto:notifications@github.com> 写道:

The format of HiC-Pro validpairs: read_name chr_reads1 pos_reads1 strand_reads1 chr_reads2 pos_reads2 strand_reads2 Otherstuff

D00733:341:CCHYUANXX:1:2315:16620:90410 chr1 13512 - chr1 45625736 + 199 HIC_chr1_8 HIC_chr1_127410 31 42
D00733:335:CCCTEANXX:8:1206:4664:49125 chr1 13531 - chr1 714897 - 232 HIC_chr1_8 HIC_chr1_1395 31 30
D00733:341:CCHYUANXX:1:2114:8240:60547 chr1 13553 - chr1 6052645 + 296 HIC_chr1_8 HIC_chr1_13399 30 42
D00733:341:CCHYUANXX:1:1206:15851:59665 chr1 13825 + chr1 77225986 + 185 HIC_chr1_10 HIC_chr1_206055 31 42
D00733:335:CCCTEANXX:8:2110:15873:56099 chr1 14767 + chr1 111149270 + 310 HIC_chr1_12 HIC_chr1_285570 30 42

so to create shrt.vip.bed, I could just take column 2 3 6 other stuff to make shrt.vip.bed file if distance between 3 and 6th column is less than 1kb.

I did not understand how to create long.intra.bedpe. As mentioned above, it requires chr1 start1 end1 chr2 start2 end2. I am wondering what is the end1 and end2 ? from HiC pro, all we have is the start and end position of one read pair. how should I create long.intra.bedpe` from this ?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ijuric/MAPS/issues/3#issuecomment-449728411, or mute the thread https://github.com/notifications/unsubscribe-auth/AhX2F1Ind8QtZ_7ze8RI7u1JSHzCu12-ks5u8MiegaJpZM4ZPSFh.

何江平(Jiangping He) Guangzhou Institutes of Biomedicine and Health(GIBH),CAS Guangzhou, Guangdong, China Email: he_jiangping@gibh.ac.cn

ijuric commented 5 years ago

Hi Jiangping, Sorry for late reply, I was out for holidays.

in plac-seq/hichip experiment we have IP pulldown step, and there could be biases associated with that. Hi-C does not have that problem. Some genomic regions might be easier to pull down and some harder. We think that short range reads are a good way to measure that bias. Short range reads are analogous to chip-seq reads (so one dimensional), and as such also contain some information about which genomic region is more accessible than other. We use the number of short range reads as an input is our regression in MAPS so we can calculate the expected value. In fact, compared to other sources of bias like mappability and gc, this IP accessibility (we call it IP efficiency in our paper) is much larger source of bias.

hope this helps, Ivan

On Tue, Dec 25, 2018 at 12:23 AM jphe notifications@github.com wrote:

Hi, Ivan.

Thanks for your reply.

Actually in the HiC-Pro output validpairs, it has no PETs that distances less than 1kb as it has already filtered by default. One thing I’m not clear is, why MAPS need the reads that less than 1kb distances, as it self-ligation that need to be removed for usual HiC pipelines. For the PETs that distance less than 1kb it was stored in the DumpPairs, DEPairs REPairs, SCPairs and SinglePairs( some of them maybe empty depends on your settings). So I merged all HiCPro output Pairs into one like the format you mentioned.

For HiCPro, it only record the start site of the reads, so for the end1 and end2, I just add the sequence length (for mine is 150bp) from the start site, just like the code bellow.

array=(chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chrX chrY) for k in ${array[@]} do for f in ../raw.bed do bf=basename $f tt=echo $bf| sed -r 's/.bed//g' if ! [ -f $tt.$k.bedPE ] then echo $tt.$k cat $f |awk -v chr=$k '{OFS=FS="\t"}{ if ($2==chr && $5==chr && $6-$4<1000 ) print $2,$3,$6}' | sort -k1,1 -k2,2n | uniq > $out.shrt.vip.bed # awk '!x[$0]++' cat $f |awk -v chr=$k '{OFS=FS="\t"}{ if ($2==chr && $5==chr && $6-$3>=1000 ) print $2,$3,$3+150,$5,$6,$6+150 }' | sort -k1,1 -k2,2n | uniq

$out.long.intra.bedpe fi done done

Jiangping He

在 2018年12月24日,下午8:29,Goutham <notifications@github.com <mailto: notifications@github.com>> 写道:

The format of HiC-Pro validpairs: read_name chr_reads1 pos_reads1 strand_reads1 chr_reads2 pos_reads2 strand_reads2 Otherstuff

D00733:341:CCHYUANXX:1:2315:16620:90410 chr1 13512 - chr1 45625736 + 199 HIC_chr1_8 HIC_chr1_127410 31 42 D00733:335:CCCTEANXX:8:1206:4664:49125 chr1 13531 - chr1 714897 - 232 HIC_chr1_8 HIC_chr1_1395 31 30 D00733:341:CCHYUANXX:1:2114:8240:60547 chr1 13553 - chr1 6052645 + 296 HIC_chr1_8 HIC_chr1_13399 30 42 D00733:341:CCHYUANXX:1:1206:15851:59665 chr1 13825 + chr1 77225986 + 185 HIC_chr1_10 HIC_chr1_206055 31 42 D00733:335:CCCTEANXX:8:2110:15873:56099 chr1 14767 + chr1 111149270 + 310 HIC_chr1_12 HIC_chr1_285570 30 42

so to create shrt.vip.bed, I could just take column 2 3 6 other stuff to make shrt.vip.bed file if distance between 3 and 6th column is less than 1kb.

I did not understand how to create long.intra.bedpe. As mentioned above, it requires chr1 start1 end1 chr2 start2 end2. I am wondering what is the end1 and end2 ? from HiC pro, all we have is the start and end position of one read pair. how should I create long.intra.bedpe` from this ?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/ijuric/MAPS/issues/3#issuecomment-449728411>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AhX2F1Ind8QtZ_7ze8RI7u1JSHzCu12-ks5u8MiegaJpZM4ZPSFh .

何江平(Jiangping He) Guangzhou Institutes of Biomedicine and Health(GIBH),CAS Guangzhou, Guangdong, China Email: he_jiangping@gibh.ac.cn

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ijuric/MAPS/issues/3#issuecomment-449803640, or mute the thread https://github.com/notifications/unsubscribe-auth/ALPpZS3AkdHfovUjOWbd_i5PTTjJdechks5u8bZhgaJpZM4ZPSFh .