nservant / HiC-Pro

HiC-Pro: An optimized and flexible pipeline for Hi-C data processing
Other
383 stars 183 forks source link

hicpro2juicebox.sh #49

Closed Krannich479 closed 8 years ago

Krannich479 commented 8 years ago

Hej, I am highly interested to use my HiC-Pro output to parse it into juicebox. Since I am using a custom genome of my department, it is none of the default "hg18, hg19, hg38, dMel, mm9, mm10, anasPlat1, bTaurus3, canFam3, equCab2, galGal4, Pf3D7, sacCer3, sCerS288c, susScr3, or TAIR10" of the hicpro2juicebox.sh. So my question is: where does the -g GENOME get sourced from, what exactly is the genome (fasta/ bw2 index/ .sizes?) and therefore can I create a way/workaround to get my output into the script?

Thanks, TK

nservant commented 8 years ago

Hi, Good point ! According to a recent post in Juicebox forum, it should be possible if you give it a chromosome size information.

"To use a genome that hasn't yet been included in the standard distribution, simply use UCSC's chrom.sizes file in place of the genome ID. See the previous thread: "

https://groups.google.com/forum/embed/?place=forum/3d-genomics&showsearch=true&showpopout=true&parenturl=http%3A%2F%2Faidenlab.org%2Fforum.html#!topic/3d-genomics/uH0uOw6gV_c https://groups.google.com/forum/embed/?place=forum/3d-genomics&showsearch=true&showpopout=true&parenturl=http%3A%2F%2Faidenlab.org%2Fforum.html#%21topic/3d-genomics/uH0uOw6gV_c

I also have to test that on my side, and to update the hicpro2juicebox script accordingly. I can try to send you an update of the script asap in order to try it. Otherwise, if you want to move forward, you can simply look at the hicpro2juicebox script. Actually, this script works in two steps, 1/ generate the input for Juicebox 2/ run Juicebox to build the .hic file which could be loaded into juicebox. Simply comment line 147 which remove the input files (step 1), and try to run juicebox in command line (line 140 - 145)

I'll come back to you Nicolas

Le 23/06/2016 10:13, Thomas Krannich a écrit :

Hej, I am highly interested to use my HiC-Pro output to parse it into juicebox. Since I am using a custom genome of my department, it is none of the default "hg18, hg19, hg38, dMel, mm9, mm10, anasPlat1, bTaurus3, canFam3, equCab2, galGal4, Pf3D7, sacCer3, sCerS288c, susScr3, or TAIR10" of the hicpro2juicebox.sh. So my question is: where does the -g GENOME get sourced from, what exactly is the genome (fasta/ bw2 index?) and therefore can I create a way/workaround to get my output into the script?

Thanks, TK

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nservant/HiC-Pro/issues/49, or mute the thread https://github.com/notifications/unsubscribe/AKR2a23__SlhJZcPJ0UYazi25pTK37nGks5qOkAZgaJpZM4I8jpm.

Nicolas Servant Bioinformatics & Biostatistics Analysis Team

Institut Curie, Plateforme de Bioinformatique Unité 900 : Institut Curie - Inserm - Mines ParisTech 26, rue d'Ulm - 75248 Paris Cedex 05 - FRANCE Email: Nicolas.Servant@curie.fr Tel: 01 56 24 69 85

Krannich479 commented 8 years ago

Hej, thanks for your detailed reply. Meanwhile I did the following: I added the myGenome.chrom.sizes into ./juicebox/src/juicebox/tools/chrom/sizes/ ./juicebox/out/production/Juicebox/juicebox/tools/chrom/sizes/ (both, because I just looked for the folder with the default genomes and have no clue which one was sufficient) and re-build the juicebox JARs. Since the build was successful and the initial error got removed, I guess I found a generic solution. Nevertheless I still got stuck with some parsing errors:

./hicpro2juicebox.sh -i /path/rawdata_1000000_iced.matrix -g myGenome -j /path/Juicebox_clt_jar/Juicebox.jar -t /path/tmp/ -o /path/
Generating Juicebox input files ...
Running Juicebox ...
WARNING: Not including fragment map
Start preprocess
Writing header
Writing body
java.io.IOException: Unexpected column count.  Only 11 or 16 columns supported.  Check file format
    at juicebox.tools.utils.original.AsciiPairIterator.advance(AsciiPairIterator.java:105)
    at juicebox.tools.utils.original.AsciiPairIterator.<init>(AsciiPairIterator.java:70)
    at juicebox.tools.utils.original.Preprocessor.computeWholeGenomeMatrix(Preprocessor.java:417)
    at juicebox.tools.utils.original.Preprocessor.writeBody(Preprocessor.java:301)
    at juicebox.tools.utils.original.Preprocessor.preprocess(Preprocessor.java:213)
    at juicebox.tools.clt.old.PreProcessing.run(PreProcessing.java:94)
    at juicebox.tools.HiCTools.main(HiCTools.java:78)
java.lang.RuntimeException: No reads in Hi-C contact matrices. This could be because the MAPQ filter is set too high (-q) or because all reads map to the same fragment.
    at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.mergeAndWriteBlocks(Preprocessor.java:1387)
    at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.access$000(Preprocessor.java:1158)
    at juicebox.tools.utils.original.Preprocessor.writeMatrix(Preprocessor.java:572)
    at juicebox.tools.utils.original.Preprocessor.writeBody(Preprocessor.java:303)
    at juicebox.tools.utils.original.Preprocessor.preprocess(Preprocessor.java:213)
    at juicebox.tools.clt.old.PreProcessing.run(PreProcessing.java:94)
    at juicebox.tools.HiCTools.main(HiCTools.java:78)
done !

Can you help me out? What if I miss the -r parameter? I never specified any restriction fragment file... does HiC-Pro returns/produces some default?

Thanks!

nservant commented 8 years ago

Could comment the "rm" line in the script ? This should generate a tmp folder with the file which is submitted to Juicebox. Could you send me the first 10 lines of this file ? Thanks

Le 23/06/2016 11:48, Thomas Krannich a écrit :

Hej, thanks for your detailed reply. Meanwhile I did the following: I added the myGenome.chrom.sizes into ./juicebox/src/juicebox/tools/chrom/sizes/ ./juicebox/out/production/Juicebox/juicebox/tools/chrom/sizes/ (both, because I just looked for the folder with the default genomes and have no clue with one was sufficieny) and re-build the juicebox JARs. Since the build was successful and the initial error got removed, I guess I found a generic solution. Nevertheless I still got stuck with some parsing errors:

|./hicpro2juicebox.sh -i /path/rawdata_1000000_iced.matrix -g myGenome -j /path/Juicebox_clt_jar/Juicebox.jar -t /path/tmp/ -o /path/ Generating Juicebox input files ... Running Juicebox ... WARNING: Not including fragment map Start preprocess Writing header Writing body java.io.IOException: Unexpected column count. Only 11 or 16 columns supported. Check file format at juicebox.tools.utils.original.AsciiPairIterator.advance(AsciiPairIterator.java:105) at juicebox.tools.utils.original.AsciiPairIterator.(AsciiPairIterator.java:70) at juicebox.tools.utils.original.Preprocessor.computeWholeGenomeMatrix(Preprocessor.java:417) at juicebox.tools.utils.original.Preprocessor.writeBody(Preprocessor.java:301) at juicebox.tools.utils.original.Preprocessor.preprocess(Preprocessor.java:213) at juicebox.tools.clt.old.PreProcessing.run(PreProcessing.java:94) at juicebox.tools.HiCTools.main(HiCTools.java:78) java.lang.RuntimeException: No reads in Hi-C contact matrices. This could be because the MAPQ filter is set too high (-q) or because all reads map to the same fragment. at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.mergeAndWriteBlocks(Preprocessor.java:1387) at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.access$000(Preprocessor.java:1158) at juicebox.tools.utils.original.Preprocessor.writeMatrix(Preprocessor.java:572) at juicebox.tools.utils.original.Preprocessor.writeBody(Preprocessor.java:303) at juicebox.tools.utils.original.Preprocessor.preprocess(Preprocessor.java:213) at juicebox.tools.clt.old.PreProcessing.run(PreProcessing.java:94) at juicebox.tools.HiCTools.main(HiCTools.java:78) done ! |

Can you help me out? What if I miss the -r parameter? I never specified any restriction fragment file... does HiC-Pro returns/produces some default?

Thanks!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/nservant/HiC-Pro/issues/49#issuecomment-228002057, or mute the thread https://github.com/notifications/unsubscribe/AKR2a2th8wEDmREHYE8ObdEFTCASTy-sks5qOlZYgaJpZM4I8jpm.

Nicolas Servant Bioinformatics & Biostatistics Analysis Team

Institut Curie, Plateforme de Bioinformatique Unité 900 : Institut Curie - Inserm - Mines ParisTech 26, rue d'Ulm - 75248 Paris Cedex 05 - FRANCE Email: Nicolas.Servant@curie.fr Tel: 01 56 24 69 85

Krannich479 commented 8 years ago

1 1 0 1 1 0.000000 1
1 1 0 1 10 0.000000 1
1 1 0 1 1000 0.000000 1
1 1 0 1 1002 0.000000 1
1 1 0 1 10044 0.000000 1
1 1 0 1 1007 0.000000 1
1 1 0 1 1008 0.000000 1
1 1 0 1 1009 0.000000 1
1 1 0 1 1011 0.000000 1
1 1 0 1 1014 0.000000 1

from the /tmp folder, file: head 15898_allValidPairs.pre_juicebox_sorted It is missing some columns according to the juicebox default: `

`
nservant commented 8 years ago

There is something wrong indeed. You should get something like ;

SRR400264.100043 1 1 73312058 0 0 1 105973177 1 42 42 SRR400264.100055 0 1 105829190 0 1 1 106032267 1 42 42 SRR400264.100100 0 1 163617012 0 1 1 163633673 1 42 42 SRR400264.100167 0 1 65359270 0 0 1 66674080 1 40 42

Could you show me the head of the .allValidPairs file. Thanks

Le 23/06/2016 12:20, Thomas Krannich a écrit :

1 1 0 1 1 0.000000 1

1 1 0 1 10 0.000000 1

1 1 0 1 1000 0.000000 1

1 1 0 1 1002 0.000000 1

1 1 0 1 10044 0.000000 1

1 1 0 1 1007 0.000000 1

1 1 0 1 1008 0.000000 1

1 1 0 1 1009 0.000000 1

1 1 0 1 1011 0.000000 1

1 1 0 1 1014 0.000000 1

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/nservant/HiC-Pro/issues/49#issuecomment-228009142, or mute the thread https://github.com/notifications/unsubscribe/AKR2a8Zcutm9iCXU6x2MW6KKhtSf0FdLks5qOl4KgaJpZM4I8jpm.

Nicolas Servant Bioinformatics & Biostatistics Analysis Team

Institut Curie, Plateforme de Bioinformatique Unité 900 : Institut Curie - Inserm - Mines ParisTech 26, rue d'Ulm - 75248 Paris Cedex 05 - FRANCE Email: Nicolas.Servant@curie.fr Tel: 01 56 24 69 85

Krannich479 commented 8 years ago

I couldn't find a ".allValidPairs file". Since I commented out line 147 I got 2 files. The one above and the second attached. (binary cryptic mess?)

attachment deleted

nservant commented 8 years ago

I see. I just look at your command line ;

|./hicpro2juicebox.sh -i /path/rawdata_1000000_iced.matrix -g myGenome -j /path/Juicebox_clt_jar/Juicebox.jar -t /path/tmp/ -o /path/|

The -i is the list of valid interactions (not the matrix). You should find this file in hic_results/data/XXX/*.allValidPairs

Le 23/06/2016 13:37, Thomas Krannich a écrit :

I couldn't find a ".allValidPairs file". Since I commented out line 147 I got 2 files. The one above and the second attached. (binary cryptic mess?)

rawdata_1000000_iced.matrix.head10.hic.zip https://github.com/nservant/HiC-Pro/files/329823/rawdata_1000000_iced.matrix.head10.hic.zip

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/nservant/HiC-Pro/issues/49#issuecomment-228023650, or mute the thread https://github.com/notifications/unsubscribe/AKR2a2ZR2mytdHmbMOMaSfc_AKqI1M-5ks5qOm_wgaJpZM4I8jpm.

Nicolas Servant Bioinformatics & Biostatistics Analysis Team

Institut Curie, Plateforme de Bioinformatique Unité 900 : Institut Curie - Inserm - Mines ParisTech 26, rue d'Ulm - 75248 Paris Cedex 05 - FRANCE Email: Nicolas.Servant@curie.fr Tel: 01 56 24 69 85

Krannich479 commented 8 years ago

First of all: thanks a lot for your effort & support! I found that file, I'll give it a try these days and come back to you.

nservant commented 8 years ago

great let me know if it works. I will try to update the code for any organism. I think I will simply replace the -g parameter, asking to provide the chromosome length file. This file is also used by HiC-Pro, so everybody should have it. I will make some test.

Le 23/06/2016 13:46, Thomas Krannich a écrit :

First of all: thanks a lot for your effort & support! I found that file, I'll give it a try these days and come back to you.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/nservant/HiC-Pro/issues/49#issuecomment-228025921, or mute the thread https://github.com/notifications/unsubscribe/AKR2a29v4j0Fk3bgy-sV9ABBk8771HYIks5qOnIhgaJpZM4I8jpm.

Nicolas Servant Bioinformatics & Biostatistics Analysis Team

Institut Curie, Plateforme de Bioinformatique Unité 900 : Institut Curie - Inserm - Mines ParisTech 26, rue d'Ulm - 75248 Paris Cedex 05 - FRANCE Email: Nicolas.Servant@curie.fr Tel: 01 56 24 69 85

nservant commented 8 years ago

Hi, Actually I made a short test, and the current script hicpro2juicebox is working as is ... I will just update the form to be as clear as possible.

For instance, for Human I usually used ; ~/GIT/HiC-Pro/bin/utils/hicpro2juicebox.sh -i hicpro_test/hic_results/data/dixon_2M/dixon_2M_allValidPairs -j ~/Apps/Juicebox/juicebox-master/out/artifacts/Juicebox_clt_jar/Juicebox.jar -g hg19

Now I tried

~/GIT/HiC-Pro/bin/utils/hicpro2juicebox.sh -i hicpro_test/hic_results/data/dixon_2M/dixon_2M_allValidPairs -j ~/Apps/Juicebox/juicebox-master/out/artifacts/Juicebox_clt_jar/Juicebox.jar -g ~/GIT/HiC-Pro/annotation/chrom_hg19.sizes -c

And it works well. So you can specify any chromosome sizes file to Juicebox without changing the data in /src/juicebox/tools/chrom/sizes/ Note that here I added a few weeks ago a "-c" option to add the "chr" on chromosome name depending on Juicebox. But I will remove it, and simply ask the user to specifiy the HiCPro chromosome file. The new version should be released in a few minutes. Best

Krannich479 commented 8 years ago

Hej, I general your commands should work fine, again, for default genomes. It took a while for me to get into your code now 100% to catch more pit holes. After keeping line 135 (tmp files) I found out how you process the chromosome names in your gigantic awk command: for some reasons (might be necessary for default genomes) you trim away the "chr" prefix and add it again later. However, with non-default genomes possibly come chromosome names without the exact "chr" prefix. This isn't a big deal for your rearrangement, but for the juicebox_CLT.jar afterwards. So, I can't really give you a particular solution to it since I don't know if you need this kind of prefix name treatment for default genomes, but I removed all "chr" prefixes from line 120 (no $RESFRAG case) on the variables $2 and $5 to get the parsing right for my non-default genome. Some minor comments on the function usage/help: 1.) If myGenome.chrom.sizes is within the JAR, you can still use the shortcut of the chromosome size file without the whole path, but I appreciate the new -g, it will speed up things in my further projects I guess 2.) -g GSIZE (help) doesn't match -g GENOME (usage)

I am not completely done yet with this hicpro2juicebox.sh, so maybe you can keep this thread open for a few more days in case I find more possible fixes/updates or you like to re-comment.

Thank you so much for your fast replies and solutions! (not all heroes wear capes)

Best, TK

nservant commented 8 years ago

Thanks to you for all your tests ;) I think it is fine now. I cleaned the code. Actually, using the HiCPro chromosome length file is much more simplier than using the Juicebox annotation. Now, I do not really care about chromosomes' name, and other details as I'm sure that there is a good concordance between all input files (as they are all generated using HiCPro) I commit a new version of the code where I simplified the gigantic awk command ... Many thanks again Nicolas

Le 28/06/2016 17:44, Thomas Krannich a écrit :

Hej, I general your commands should work fine, again, for default genomes. It took a while for me to get into your code now 100% to catch more pit holes. After keeping line 135 (tmp files) I found out how you process the chromosome names in your gigantic awk command: for some reasons (might be necessary for default genomes) you trim away the "chr" prefix and add it again later. However, with non-default genomes possibly come chromosome names without the exact "chr" prefix. This isn't a big deal for your rearrangement, but for the juicebox_CLT.jar afterwards. So, I can't really give you a particular solution to it since I don't know if you need this kind of prefix name treatment for default genomes, but I removed all "chr" prefixes from line 120 (no $RESFRAG case) on the variables $2 and $5 to get the parsing right for my non-default genome. Some minor comments on the function usage/help: 1.) If myGenome.chrom.sizes is within the JAR, you can still use the shortcut of the chromosome size file without the whole path, but I appreciate the new -g, it will speed up things in my further projects I guess 2.) -g GSIZE (help) doesn't match -g GENOME (usage)

I am not completely done yet with this hicpro2juicebox.sh, so maybe you can keep this thread open for a few more days in case I find more possible fixes/updates or you like to re-comment.

Thank you so much for your fast replies and solutions! (not all heroes wear capes)

Best, TK

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/nservant/HiC-Pro/issues/49#issuecomment-229090779, or mute the thread https://github.com/notifications/unsubscribe/AKR2axJTGTpst67lsLkfrnkVT8R_0Kx-ks5qQUFhgaJpZM4I8jpm.