vaquerizaslab / chess

Comparison of Hi-C Experiments using Structural Similarity.
Other
26 stars 6 forks source link

Chess sim quits with ERROR "all regions need to span at least 20 bins " #23

Open rikrdo89 opened 3 years ago

rikrdo89 commented 3 years ago

I have two hic files, and I want to compare the differences across this files for a given set of loops (bedpe). Whenever I use the hic files, even at the lowest resolution of 5k, I always get the 20 bin error. Is there any way to disable this?

rikrdo89 commented 3 years ago

Also when I filtered my reads to be more than the 20X limit, 99.9% of my bedpe regions fail with the following error: "Could not compute similarity for [number] regions. this cna be due to faulty coordinates, too smallregion sizes or too many unmappable bins"

biozzq commented 3 years ago

Maybe you could try using the normalized hic files generated by juicer. I also found the cool files sometimes can fail. Please refer to https://github.com/vaquerizaslab/chess/issues/16, however, I also confused about the normalization procedures in the chess publication.

Best wishes, Zheng Zhuqing

nickmachnik commented 3 years ago

Hi @rikrdo89 , to answer your first question, chess should only quit with this message if all regions you submitted in the bedpe span less than 20 bins. Is that the case?

About the second: do I understand right that you get numeric output for some of the regions (0.01%) ? That is, not all rows in the ouput file are nan?

rikrdo89 commented 3 years ago

Hi Nick. Yes out of the 5000 regions i used (that span more than 20x bins) only one didn't have nan for the first two fields. The last field, however, also have a nan.

nickmachnik commented 3 years ago

could you try to update fanc to 0.9.9 (pip install fanc --upgrade)? @kaukrise just pushed a great fix for an issue with cooler file handling, which fixes the all NaN issue (your first question, possibly) for me.

rikrdo89 commented 3 years ago

I updated fanc to 0.9.9 and I still have the same issue.

$ chess sim dmso.allValidPairs.hic@10000 tx4h.allValidPairs.hic@10000 dmso_loops_greater200k.bedpe dmso_vs_tx4.tsv
2020-12-01 19:23:19,079 INFO Running 'user/miniconda3/envs/py3/bin/chess sim dmso.allValidPairs.hic@10000 tx4h.allValidPairs.hic@10000 dmso_loops_greater200k.bedpe dmso_vs_tx4.tsv'
2020-12-01 19:23:21,001 INFO CHESS version: 0.3.5
2020-12-01 19:23:21,001 INFO FAN-C version: 0.9.9
2020-12-01 19:23:21,003 INFO Loading reference contact data
2020-12-01 19:33:30,297 INFO Loading query contact data
2020-12-01 19:42:26,874 INFO Loading region pairs
2020-12-01 19:42:26,913 INFO Launching workers
2020-12-01 19:42:27,236 INFO Submitting pairs for comparison
2020-12-01 19:42:28,780 INFO Could not compute similarity for 4046 region pairs.This can be due to faulty coordinates, too smallregion sizes or too many unmappable bins
2020-12-01 19:42:50,681 INFO Finished '/user/miniconda3/envs/py3/bin/chess sim dmso.allValidPairs.hic@10000 tx4h.allValidPairs.hic@10000 dmso_loops_greater200k.bedpe dmso_vs_tx4.tsv'
rikrdo89 commented 3 years ago

Do the regions in the bedpe need to be formatted in any particular way? I wonder if I need to do some sorting or maybe binning of the regions ? Or is this a different issue?

nickmachnik commented 3 years ago

The bedpe should look like the output of chess pairs. I take from this that you have generated the bedpe yourself, right? Could you generate a pairs file with that command and run it on your data, just to see if you still get so many nans?

rikrdo89 commented 3 years ago

Correct. I generated the bedpe from running a loop caller, and then formatting it in a way chess would not immediately quit ( removing chr, adding unique identifier, etc) I will run the chess pair command and let you know if it works for me. Just a clarification question, since my goal is to compare changes between two hic samples around already called loops and domains, was using 'chess sim' the way I was initially doing correct ?

nickmachnik commented 3 years ago

I think this is a reasonable application of chess, and the command you posted looks good to me.

rikrdo89 commented 3 years ago

I generated a bedpe using chess pairs --chromosome chr19 mm10 500000 100000 mm10_chr19_500kb_win_100kb_step.bedpe . The resulting bedpe has the string chr in the chromosomes (chr1, chr2...) and thus breaks chess sim (After waiting for a while to load the reference and query hic). I removed all instances of the chr string from the bedpe, fed it into:

chess sim ../../hic_links/dmso.allValidPairs.hic@5000 ../../hic_links/tx4h.allValidPairs.hic@5000 ../mm10_19_500kb_win_100kb_step.bedpe dmso_vs_tx4.tsv

and it now works. I get a result table as expected. However there were 37 out of 610 regions that failed, and this regions are mainly at the beginning and at the end of chromosome 19. So this makes me think that there may be a total read cutoff and thus why it may be failing in my loop calls? or maybe it is not just the distance of the loop anchor (20X bins) but also the actual size of the patch (submatrix) enclosed by the anchors that needs to be greater than a certain number of bins?

I will troubleshoot a little more. In the mean time, could you please recommend some optimal parameter for chess pair for doing a genome wide scan on a 5kb matrix? This time I used 500kb window-size and 100kb step-size, but if you could offer any guidance for setting this parameters, I would appreciate it a lot.

rikrdo89 commented 3 years ago

Also could you please clarify if the comparisons are being done with the raw values of the hic or normalized values? or could I specify which set of values to use by inputting something like control.hic@5000@KR ?

nickmachnik commented 3 years ago

The bedpe generated by chess sim has the "chr" prefixes because this is the format at USCS, from where the chromosome size data is retrieved by default.

chess requires all matrices to have at least 20 x 20 "pixels". If the distance between your loop anchors (i.e. the difference between columns 2 and 3 in the bedpe file) is at least 20 * your bin size, the region is large enough. However, chess is more suitable for larger regions, we recommend spans >= 100 bins.

Chess also imposes a maximum fraction of unmappable / masked bins in a matrix, by default this is 0.1 (see --mappability-cutoff in chess sim --help). The masking value is 0. So, if you have small regions of 20 bins span, 2 unmappable bins in that region are enough for the region to be filtered out, which you see as nans in the results file. This is common in repetitive regions like the ends of chromosomes. If you really need a comparison in a part of the genome which is difficult to map reads to you have to increase the region size, and/or lower the mappability-cutoff, but in general I would recommend to avoid such comparisons.

You have to provide normalized matrices, e.g. Knight-Ruiz balanced. The comparisons are done on observed/expected transformed matrices. If your data is not in OE format, it is automatically transformed. This all goes through FAN-C. I believe that the normalized values are used automatically if present in the provided data, but I am not 100 percent sure here: @kaukrise might now better.

I cannot really tell you what parameters to use for chess pairs, it depends what you are looking for, what the size of typical features in the species is etc. If this is a human or mouse genome, and you expect a median TAD size of 1mb, a good first run could be with 2-3mb window size. 250 kb step may be small enough. If you have 5kb resolution and want to focus on differences within smaller linear genomic spans, 500kb may be a good choice too. For now, chess really requires the user to play around a little, I hope that we might find some good rules for these parameters with more experience.

rikrdo89 commented 3 years ago

This is quite helpful... I see, so the problem must be the 20x20 pixels, i.e. the 20X min is actually the size of the side sub-matrix, not the min distance of the loop. I was filtering my loop coordinates based on the distance between column 5 and column 2. So maybe running chess sim for small areas of local enrichment (loops or corner peaks) is not ideal, since a 20x20 region will make a really large square... unless you think it may be worth expanding the loop regions in the bedpe file to encompass more bins, and then re-running chess sim

For the balanced matrices, I can use a cool file with all the normalization and balance I previously applied to it, but chess takes a long time. So I switched to the .hic files, which have different resolutions, balancing, and O/E values. So I guess if I use a hic file, I can specify the resolution ( say 10Kb using myFile.hic@10000 ), and chess will use the O/E stored in this file... But it is my understanding that the .hic file stores different O/E values for each of the corrected values (KR, Vanilla coverage, etc)... At least that is what I see when I explore the .hic file in juicer... I think in FAN-C you can specify the correction by adding a @ such as myFile.hic@10000@KR ... I guess I can try it in chess... and wait.... to see if it changes anything.

This is extremely helpful. Thanks Nick!

nickmachnik commented 3 years ago

One more thing, it just occurred to me that I might be misunderstanding what you are trying to do: if you say that you want to compare loops, does that mean that you are trying to somehow feed off-diagonal regions to chess, for instance a small submatrix around a TAD corner peak? If yes, then I am sorry: this is not what chess was designed for, and it will not work. The software was made to compare regions along the diagonal of HiC matrices between different datasets, which are specified just with the start bin and end coordinates (columns 2, 3 for the reference and 5 and 6 for the query).

rikrdo89 commented 3 years ago

Yeah that's what I initially thought, I was hoping it could make some comparisons of these loops as well... and I think it does to some extend as long as they are large regions, but I understand that chess was not made to make this comparisons. Thanks again Nick. Now I am doing a whole genome scan as you did in your paper. I am inputting myFile.hic@5000@KR and so far chess has not complained about the extra @KR, but it is still reading the file ... ... ...