Closed cmdoret closed 4 years ago
Refactoring started in refactoring
branch.
Also maybe replace partial functions and global variables for pattern-specific configurations JSON config files for each pattern. E.g.: loops.json:
{ "name" : "loops",
"kernels" : [
"kernels/loops1.txt",
"kernels/loops2.txt",
],
"max_dist" : 2000000,
"max_iterations" : 3,
"max_perc_undetected" : 30,
}
which can then be loaded as :
import json
with open("loops.json", "r") as config:
kernel_config = json.load(config)
And accessed as a dictionary:
>>>kernel_config["max_iterations"]
3
This would allow to completely separate kernels from code and to add new kernels without modifying the source code
Support for kernel config files added in 7d611b60f655312bf7c4c9cb704308e851a17840. I also added a resolution field in the config that we could eventually use to rescale the kernel to the input matrix.
To improve the labelling of poor interacting bins and their subsequent filtering, we use mads:
mads = np.nanmedian( abs(matrix.sum(axis=0)-np.median(matrix.sum(axis=0))), 0)
threshold_vector = np.median(matrix.sum(axis=0)) - 2.0 * mads
poor_indices = np.where(matrix.sum(axis=0) <= threshold_vector)
This should remove < 10% of the bins and remove potential false positive. Le 2.0 pourrait aussi être un paramètre.
I tested it and on the data yeast at 1 kb it removes lot of false positive.
To improve the labelling of poor interacting bins and their subsequent filtering, we use mads:
mads = np.nanmedian( abs(matrix.sum(axis=0)-np.median(matrix.sum(axis=0))), 0) threshold_vector = np.median(matrix.sum(axis=0)) - 2.0 * mads poor_indices = np.where(matrix.sum(axis=0) <= threshold_vector)
This should remove < 10% of the bins and remove potential false positive. Le 2.0 pourrait aussi être un paramètre.
Thanks Axel ! added this in a46979a81ed6529607c17a51f7add61fa9011c88
def detrend(matrix):
mads = np.nanmedian( abs(matrix.sum(axis=0)-np.median(matrix.sum(axis=0))), 0)
threshold_vector = np.median(matrix.sum(axis=0)) - 2.0 * mads # Removal of poor interacting bins
poor_indices = np.where(matrix.sum(axis=0) <= threshold_vector)
print("shape of the matrice")
print(shape(matrix))
print( len(poor_indices[0] ) )
print( len(poor_indices[0] ) / matrix.shape[0] * 100.0 )
matscn = scn_func(matrix, 0)
_, matscn, _, _ = despeckles(matscn, 10.0)
y, y_dist_median, y_dist_mads = distance_law(matscn)
y[np.isnan(y)] = 0.0
y_savgol = savgol_filter(y, window_length=17, polyorder=5)
# here filtering of area with too much noise:
snr = y_dist_median / y_dist_mads
threshold_noise = 1.0
snr[np.isnan(snr)] = 0.0
y_savgol[ snr < threshold_noise ] = 0.0
n = matrix.shape[0]
# Computation of genomic distance law matrice:
distance_law_matrix = np.zeros((n, n))
for i in range(0, n):
for j in range(0, n):
distance_law_matrix[i, j] = y_savgol[abs(j - i)]
detrended = matscn / distance_law_matrix
detrended[np.isinf(detrended)] = 0.0
detrended[np.isnan(detrended)] = 0.0
# detrended[detrended < 0] = 1.0
# refilling of empty bins with 1.0 (neutral):
detrended[poor_indices[0], :] = np.ones((len(poor_indices[0]), n))
detrended[:, poor_indices[0]] = np.ones((n, len(poor_indices[0])))
return detrended, threshold_vector
All points have been addressed in #6 closing now
The sparse version of the pipeline is functional as of 03fa26a30573ecedcd60519caf81582454f92d4d but a few things could be improved:
chromovision create-kernel
) to generate template kernel config to help users making new configs.