koszullab / chromosight

Computer vision based program for pattern recognition in chromosome (Hi-C) contact maps
https://chromosight.readthedocs.io
Other
61 stars 9 forks source link

Features to add and things to improve #2

Closed cmdoret closed 4 years ago

cmdoret commented 5 years ago

The sparse version of the pipeline is functional as of 03fa26a30573ecedcd60519caf81582454f92d4d but a few things could be improved:

cmdoret commented 5 years ago

Refactoring started in refactoring branch.

cmdoret commented 5 years ago

Also maybe replace partial functions and global variables for pattern-specific configurations JSON config files for each pattern. E.g.: loops.json:

{ "name" : "loops",
  "kernels" : [
        "kernels/loops1.txt",
        "kernels/loops2.txt",
    ],
  "max_dist" : 2000000,
  "max_iterations" : 3,
  "max_perc_undetected" : 30,
}

which can then be loaded as :

import json
with open("loops.json", "r") as config:
    kernel_config = json.load(config)

And accessed as a dictionary:

>>>kernel_config["max_iterations"]
3

This would allow to completely separate kernels from code and to add new kernels without modifying the source code

cmdoret commented 5 years ago

Support for kernel config files added in 7d611b60f655312bf7c4c9cb704308e851a17840. I also added a resolution field in the config that we could eventually use to rescale the kernel to the input matrix.

axelcournac commented 5 years ago

To improve the labelling of poor interacting bins and their subsequent filtering, we use mads:

mads = np.nanmedian( abs(matrix.sum(axis=0)-np.median(matrix.sum(axis=0))), 0)
threshold_vector = np.median(matrix.sum(axis=0)) - 2.0 * mads 
poor_indices = np.where(matrix.sum(axis=0) <= threshold_vector)

This should remove < 10% of the bins and remove potential false positive. Le 2.0 pourrait aussi être un paramètre.

axelcournac commented 5 years ago

I tested it and on the data yeast at 1 kb it removes lot of false positive.

cmdoret commented 5 years ago

To improve the labelling of poor interacting bins and their subsequent filtering, we use mads:

mads = np.nanmedian( abs(matrix.sum(axis=0)-np.median(matrix.sum(axis=0))), 0)
threshold_vector = np.median(matrix.sum(axis=0)) - 2.0 * mads 
poor_indices = np.where(matrix.sum(axis=0) <= threshold_vector)

This should remove < 10% of the bins and remove potential false positive. Le 2.0 pourrait aussi être un paramètre.

Thanks Axel ! added this in a46979a81ed6529607c17a51f7add61fa9011c88

axelcournac commented 5 years ago
def detrend(matrix):

    mads = np.nanmedian( abs(matrix.sum(axis=0)-np.median(matrix.sum(axis=0))), 0)
    threshold_vector = np.median(matrix.sum(axis=0)) - 2.0 * mads  # Removal of poor interacting bins
    poor_indices = np.where(matrix.sum(axis=0) <= threshold_vector)
    print("shape of the matrice")
    print(shape(matrix))
    print( len(poor_indices[0] ) )
    print( len(poor_indices[0] )  / matrix.shape[0] * 100.0 )
    matscn = scn_func(matrix, 0)
    _, matscn, _, _ = despeckles(matscn, 10.0)

    y, y_dist_median, y_dist_mads = distance_law(matscn)
    y[np.isnan(y)] = 0.0
    y_savgol = savgol_filter(y, window_length=17, polyorder=5)

    # here filtering of area with too much noise: 
    snr = y_dist_median /  y_dist_mads
    threshold_noise  = 1.0 
    snr[np.isnan(snr)] = 0.0
    y_savgol[ snr < threshold_noise ] = 0.0

    n = matrix.shape[0]

    # Computation of genomic distance law matrice:
    distance_law_matrix = np.zeros((n, n))
    for i in range(0, n):
        for j in range(0, n):
            distance_law_matrix[i, j] = y_savgol[abs(j - i)]
    detrended = matscn / distance_law_matrix

    detrended[np.isinf(detrended)] = 0.0
    detrended[np.isnan(detrended)] = 0.0
#    detrended[detrended < 0] = 1.0
    # refilling of empty bins with 1.0 (neutral):
    detrended[poor_indices[0], :] = np.ones((len(poor_indices[0]), n))
    detrended[:, poor_indices[0]] = np.ones((n, len(poor_indices[0])))
    return detrended, threshold_vector
cmdoret commented 4 years ago

All points have been addressed in #6 closing now