vaquerizaslab / fanc

FAN-C: Framework for the ANalysis of C-like data
GNU General Public License v3.0
107 stars 14 forks source link

Clarification on matrix weights, scaling, and visualization in the context of `fanc compare` #171

Closed kalavattam closed 1 year ago

kalavattam commented 1 year ago

Hi @kaukrise,

Thanks again for this great program. I’ve been using fanc compare to generate log2-transformed comparison values with the following command:

fanc compare \
    --comparison "fold-change" \
    --log \
    --ignore-zero \
    --ignore-infinite \
    ${f_Q}@${res} ${f_2}@${res} \
    ${d_out}/${m_Q2}

I have a few questions I hope you can clarify:

  1. Applying weights: When using .cool/.mcool files, does fanc compare automatically apply weights?
  2. Scaling matrices: For matrices that have undergone KR balancing (where the sum of any given row equals the sum of any given column, and this constant is uniform across all rows and columns), is it a prerequisite to scale matrices to the same number of pairs before using fanc compare? Does this requirement also apply to ICE-balanced matrices?
  3. Visualizing zero-pair regions: When visualizing the resulting matrix with fancplot, is there a way to exclude regions with zero pairs from being colored by the colormap? In the attached PDF, it appears that regions without pairs are assigned a value of 0.
    fancplot \
    -o "pdfs/2023-0929/${m_Q2}_XII-1-800000.pdf" \
    "XII:1-800000" \
    -p square \
    --title "${res} bp, XII:1-800000" \
    -s 0 \
    -c "coolwarm" \
    "${d_out}/${m_Q2}"

log2_Q-over-G2_6400_XII-1-800000.scale.pdf

I appreciate your time and assistance. Thanks, Kris

kaukrise commented 1 year ago

Hi, thank you for your message.

  1. I presume you mean the bias vector - in that case, yes, it is automatically applied. To turn it off, you can use the -u parameter.
  2. First of all, ICE and KR balancing should give highly similar results, so this applies to both methods: if the matrices are already scaled to the same number of contacts per chromosome (this includes a transformation into contact probabilities), then there is no need to do additional scaling. In FAN-C, KR and ICE balancing do this by default (-S would be recommended); last time I checked, Juicer restored the original number of contacts (so scaling would be necessary); I don't know what Cooler does, however - you would need to consult their docs.
  3. "Unmappable" regions (with 0 contacts) are filtered out in FAN-C using the mappability vector, and coloured grey in the matrix. In Juicer we rely on NaN entries in the bias vector to do the same, but we have not implemented anything similar for Cooler files. What does the bias vector look like in those regions? Is it NaN or 0? Or does it contain a positive, finite value?
kalavattam commented 1 year ago

Thank you for taking the time to address my questions. In light of your explanations and the program's detailed documentation, I opted to convert my raw .cool files to FAN-C .hic files. Working with .hic files, in this case, seems to offer a more straightforward path to achieving my objectives.

For point 3 above, I have sought clarification here; for point 2 above, here. Thank you again—closing the issue now.

kalavattam commented 1 year ago

Response from Nezar to point 2:

By default, cooler rescales the target matrix (whole genome by default, or each chromosome for cis-only balancing) to make the marginal sums = 1.

This can be turned off with the rescale_marginals option in balance_cooler; however, we also store the original scaling factor in the metadata attributes of the weight vector:

with clr.open("r") as f:
scale = f["bins/weight"].attrs["scale"]

This scaling factor corresponds to the marginal sum of the target matrix at the end of balancing which is roughly corresponds to its average read coverage. If desired, you can restore this scale factor by multiplying a balanced matrix by scale or equivalently by multiplying the balancing weight vector by sqrt(scale). For a log2-ratio though, you don't want your contact frequencies to be proportional to coverage.

Response from Nezar to point 3:

Regions with 0 contacts are normally masked out for matrix balancing (the algorithm would never converge if they were kept). Masked/filtered bins are normally encoded as NaN in the weight vector, which will "NaN-out" the corresponding row/column of the matrix when the weights are applied (B_ij = w_i w_j A_ij).