Closed ang-one closed 8 months ago
who is gonna publish such data without repeating them? How much computational brute force and other kind of model are necessary to obtain a good fit of C1? the late behaviour of the curve is something with biological meaning? it is not better to use state-of-art dataset for such part?
I suggest this dataset as benchmark adk_strains.csv adk_over_expression.csv
from : https://www.nature.com/articles/s41559-017-0149
I mean there are no interesting biology in this dataset but the data are very high quality
other examples of LG110
The only points of this exercise is to
We don't need to use the worse case scenario. As discussed previously, let's show:
Then we can run the code in Alberto's datasets to analyse the data of experiments.
You should also run it in the Lowess smoothed data.
Question: which method did you use in the fits above? (Those data have lots of issues; I think the fits are good.)
The plot before is Lowess smoothed (with param = 0.05) and it is a NL segmentation. the command to obtain them is a new one
using DifferentialEquations, Optimization, Plots, Random
using CSV, DataFrames
using Statistics
using Optim
using OptimizationBBO
using NaNMath
using StatsBase
using Tables
using Distributions
using Interpolations
using Peaks
using ChangePointDetection
using Lowess
using LsqFit
using Combinatorics
nl_ub_1 = [2.0001 , 10.00000001, 5000.00] nl_lb_1 = [0.0001 , 0.00000001, 0.00 ]
nl_ub_2 = [2.0001 , 10.00000001, 5.00,5.0] nl_lb_2 = [0.0001 , 0.00000001, 0.00,0.0 ]
nl_ub_3 = [2.0001 , 10.00000001] nl_lb_3 = [0.0001 ,- 1.10000001]
nl_ub_4 = [2.0001 , 10.00000001, 500.00] nl_lb_4 = [0.0001 , 0.00000001, 0.00 ]
list_models_f = ["NL_Gompertz","NL_Bertalanffy","NL_exponential","NL_logistic"] list_lb =[nl_lb_1,nl_lb_2,nl_lb_3,nl_lb_4] list_ub = [nl_ub_1,nl_ub_2,nl_ub_3,nl_ub_4]
K= fit_NL_segmentation_file( "", #label of the experiment path_to_data, # path to the folder to analyze path_to_annotation,# path to the annotation of the wells list_models_f, # ode model to use list_lb, # lower bound param list_ub, # upper bound param 4; method_of_fitting="MCMC", nrep=50, list_u0=list_lb .+ (list_ub .- list_lb) ./ 2,# initial guess param optmizator=BBO_adaptive_de_rand_1_bin_radiuslimited(), # selection of optimization method path_to_results="NA", # path where save results path_to_plot="NA", # path where to save Plots loss_type="RE", # string of the type of the used loss smoothing=true, # 1 do smoothing of data with rolling average type_of_smoothing="lowess", display_plots=true,# display plots in julia or not save_plots=false, write_res=false, # write results pt_avg=1, # number of points to do smoothing average pt_smooth_derivative=0, # number of points to do ssmooth_derivative do_blank_subtraction="avg_blank", # string on how to use blank (NO,avg_subtraction,time_avg) avg_replicate=false, # if true the average between replicates is fitted. If false all replicate are fitted indipendelitly correct_negative="thr_correction", # if "thr_correction" it put a thr on the minimum value of the data with blank subracted, if "blank_correction" uses blank distrib to impute negative values thr_negative=0.01, # used only if correct_negative == "thr_correction" multiple_scattering_correction=false, # if true uses the given calibration curve to fix the data method_multiple_scattering_correction="interpolation", calibration_OD_curve="NA", # the path to calibration curve to fix the data PopulationSize=100, maxiters=2000000, abstol=0.00001, thr_lowess=0.05, dectect_number_cdp= false, fixed_cpd = true, penality_CI=4.0, beta_smoothing_ms = 2.0, verbose =false, win_size=14, # number of the point of cpd sliding win
)
What I am saying that focusing a lot on dataset LG110 to do the required tests i do not known if it is worth, I am not saying to not do the tests. This kind of behaviour very far away to any model. To run stability and inference tests between the different methods it is not better to choose a plate with less "strange" behaviour, otherwise we will encounter more bad performances or unstable situations. BTW I prepare all the script for the test and we can change dataset very easily
e.g. LG 119 it is better
In general what I mean. I have to create a benchmark/comparison of all the methods on synth data and real data. It is not better to present good benchmark because using data as LG110 brings more difficulty
Sure, there is no need to benchmark the whole thing in all datasets (it is a validation of the code). Use a dataset you like, the one from Shaknovich paper or LG119. LG110 is good to understand very nonstandard things (we might even show an example on how the code does in these cases, because no other code that fits microbial growth cover this). But to benchmark, go with one dataset you like.
I like LG119. But how is the continuity being enforced? It looks like it's fitting segment-wise and plotting without continuity (from the pics above).
in a while I will add in the other repo the segmentation test required on dataset 119
The continuity for NL fit is enforced backward with a multi objective loss = loss_fit + penality * loss_bounduary
we discussed it, it is the easiest way to implement it but i think is sub-optimal, also for the moment i have no idea how to improve it in general for any possible function
Here I comment while developing it there are various example also of the correct initialization of NL given a ODE simulation (https://github.com/pinheiroGroup/JMAKi.jl/issues/4#issuecomment-1945806394)
By using this loss = loss_fit + penality * loss_bounduary, are you adding like infinite penalty if the values of the fitted left and right functions (around a segment) are different?
It is not infinite, otherwise the convergence is practically impossible (example correspond to penalty = \inf) but loss_bounduary is never exactly 0.0.
I add a fake point on only one boundary for the segments (if they are in the middle) that are the last point of the previous fit and the loss is
loss = loss_fit + loss_bounduary loss_bounduary = error_fake_point * n_data / input_of_user
In this way if input_of_user =2 mean that error of the first point weights the half of the total dataset/segment
I the example before I was trying the over fitting if you tune input_of_user you can get good continuity I the othert issue there are some example with very high penality where you get practically flat line and also with "good" penality
This is sub-optimal
I do not know better solutions, e.g., is it possible to write this constrain as disequality and use methods of optimization that can take that input?
Examples ODE (fitting foward in segment, bounduary naturally forced)
NL (fitting backward in segment) using initialization of ODE solution
Low penality for bounduary
High penality for bounduary
Closing, this discussion will continue in the dedicated repository
For the moment i am using experiment LG110
I understand that we want jmaki working in the worst case scenario, but really it is work to tune the algorithm to work which such data. It is not better to use another dataset