Closed drlaurenwasson closed 3 years ago
Hi,
The error you are getting with 100kb windows with a 10kb step size is because your regions are only 10 Hi-C bins wide. As it says in the error message, all regions need to span at least 20 Hi-C bins. For example, 200kb windows would be the minimum for Hi-C at 10kb resolution (personally I find that regions of at least 100 bins are usually best).
For the first one, let me see if I understand correctly.
I ask because I generated a pairs file for hg38 with 3Mb windows and a 100kb step size, which gives 30,321 pairs. 147 of these are on 'alt' chromosomes, consistent with WARNING 147 region pairs have been dropped, because they involve chromosomes that are not present in the provided contact data
.
Only 1,786 are on chr5. INFO Could not compute similarity for 28497 region pairs.This can be due to faulty coordinates, too smallregion sizes or too many unmappable bins
would be consistent with all pairs regions from other chromosomes having missing data plus ~100 from chr5 that could be unmappable.
It's possible that the way the subsetting was done means that your .hic file still contains the other chromosomes, just without any data available on them. That would result in log messages like those you see above. If this is the case, and you do have some non-NA results on chr5, then I think your analysis is fine!
Hello, You were correct. I didnt do the pairs file correctly (what I thought was just chromosome 5 in fact had all chromosomes. I fixed that and it worked great. Thank you!
Hello, Sorry for another post about Nan but I think I've ruled out the other issues from the other posts.
I tried three comparisons between a wild type and a mutant HiC dataset, subsetting for a particular chromosome.
chess sim \ WT_combined_chr5.hic \ MUT_het_new_combined_chr5.hic \ hg38_chr5_1mb_win_10kb_step_2.bed \ OUTPUT_1mb_win_10kb_step_chess_results.tsv
I tried three bed files: 3mb_win_100kb_step 1mb_win_10kb_step 100kb_win_10kb_step
The first two give me this log file: 2021-05-19 03:40:54,378 INFO Running '/home/lkw10/.conda/envs/my_CHESS_env_374/bin/chess sim PGP1_combined_chr5.hic CHD4_het_new_combined_chr5.hic hg38_chr5_3mb_win_100kb_step_2.bed PGP1_combined_chr5_vs_CHD4_het_new_combined_chr5_3mb_win_100kb_step_chess_results.tsv' 2021-05-19 03:40:56,359 INFO CHESS version: 0.3.6 2021-05-19 03:40:56,359 INFO FAN-C version: 0.9.18 2021-05-19 03:40:56,361 INFO Loading reference contact data 2021-05-19 03:45:26,042 INFO Loading query contact data 2021-05-19 03:48:41,747 INFO Loading region pairs 2021-05-19 03:48:41,913 WARNING 147 region pairs have been dropped, because they involve chromosomes that are not present in the provided contact data. 2021-05-19 03:48:41,926 INFO Launching workers 2021-05-19 03:48:42,115 INFO Submitting pairs for comparison 2021-05-19 04:13:58,868 INFO Could not compute similarity for 28497 region pairs.This can be due to faulty coordinates, too smallregion sizes or too many unmappable bins 2021-05-19 04:14:04,193 INFO Finished '/home/lkw10/.conda/envs/my_CHESS_env_374/bin/chess sim PGP1_combined_chr5.hic CHD4_het_new_combined_chr5.hic hg38_chr5_3mb_win_100kb_step_2.bed PGP1_combined_chr5_vs_CHD4_het_new_combined_chr5_3mb_win_100kb_step_chess_results.tsv'
The third (most high res) log gives me this 2021-05-19 04:59:58,617 INFO Running '/home/lkw10/.conda/envs/my_CHESS_env_374/bin/chess sim PGP1_combined_chr5.hic CHD4_het_new_combined_chr5.hic hg38_chr5_100kb_win_10kb_step_2.bed PGP1_combined_chr5_vs_CHD4_het_new_combined_chr5_100kb_win_10kb_step_chess_results.tsv' 2021-05-19 04:59:59,942 INFO CHESS version: 0.3.6 2021-05-19 04:59:59,942 INFO FAN-C version: 0.9.18 2021-05-19 04:59:59,943 INFO Loading reference contact data 2021-05-19 05:04:33,082 INFO Loading query contact data 2021-05-19 05:07:54,585 INFO Loading region pairs 2021-05-19 05:07:58,064 WARNING 9089 region pairs have been dropped, because they involve chromosomes that are not present in the provided contact data. 2021-05-19 05:07:58,243 ERROR All regions need to span at least 20 bins. The provided reference regions span at most 10 bins. Please try again with larger regions or a smaller bin size. The bin size of the input data has been detected to be 10000
the bed files look like this: 1 1 100001 1 1 100001 0 . + + 1 10001 110001 1 10001 110001 1 . + + 1 20001 120001 1 20001 120001 2 . + + 1 30001 130001 1 30001 130001 3 . + +
They were generated using chess pairs and I removed the "chr" label.
The hiC files were generated using HOMER from pairfiles generated using cooler/Juicer.
I would very much appreciate your help