Testing real dataset - Githubissues

ang-one commented 9 months ago

For the moment i am using experiment LG110

I understand that we want jmaki working in the worst case scenario, but really it is work to tune the algorithm to work which such data. It is not better to use another dataset

ang-one commented 9 months ago

who is gonna publish such data without repeating them? How much computational brute force and other kind of model are necessary to obtain a good fit of C1? the late behaviour of the curve is something with biological meaning? it is not better to use state-of-art dataset for such part?

ang-one commented 9 months ago

I suggest this dataset as benchmark adk_strains.csv adk_over_expression.csv

from : https://www.nature.com/articles/s41559-017-0149

I mean there are no interesting biology in this dataset but the data are very high quality

ang-one commented 9 months ago

other examples of LG110

fepinheiromycin commented 9 months ago

The only points of this exercise is to

Compare the inference output obtained with ODE fit and NL fit -- because in general the ODE fit is less stable than NL fit, we want to show that the method that is more general is reliable
Understand what was the problem with the bad fit in one segment.

We don't need to use the worse case scenario. As discussed previously, let's show:

That NL and ODE fit give the same results without segmentation for a few models.
Check what happens when we input the parameters obtained with ODE in the NL fit.
Repeat 1 and 2 for a situation with 1 segment.

Then we can run the code in Alberto's datasets to analyse the data of experiments.

You should also run it in the Lowess smoothed data.

Question: which method did you use in the fits above? (Those data have lots of issues; I think the fits are good.)

ang-one commented 9 months ago

The plot before is Lowess smoothed (with param = 0.05) and it is a NL segmentation. the command to obtain them is a new one

using DifferentialEquations, Optimization, Plots, Random using CSV, DataFrames using Statistics using Optim using OptimizationBBO using NaNMath using StatsBase using Tables using Distributions using Interpolations using Peaks using ChangePointDetection using Lowess
using LsqFit using Combinatorics

nl_ub_1 = [2.0001 , 10.00000001, 5000.00] nl_lb_1 = [0.0001 , 0.00000001, 0.00 ]

nl_ub_2 = [2.0001 , 10.00000001, 5.00,5.0] nl_lb_2 = [0.0001 , 0.00000001, 0.00,0.0 ]

nl_ub_3 = [2.0001 , 10.00000001] nl_lb_3 = [0.0001 ,- 1.10000001]

nl_ub_4 = [2.0001 , 10.00000001, 500.00] nl_lb_4 = [0.0001 , 0.00000001, 0.00 ]

list_models_f = ["NL_Gompertz","NL_Bertalanffy","NL_exponential","NL_logistic"] list_lb =[nl_lb_1,nl_lb_2,nl_lb_3,nl_lb_4] list_ub = [nl_ub_1,nl_ub_2,nl_ub_3,nl_ub_4]

K= fit_NL_segmentation_file( "", #label of the experiment path_to_data, # path to the folder to analyze path_to_annotation,# path to the annotation of the wells list_models_f, # ode model to use list_lb, # lower bound param list_ub, # upper bound param 4; method_of_fitting="MCMC", nrep=50, list_u0=list_lb .+ (list_ub .- list_lb) ./ 2,# initial guess param optmizator=BBO_adaptive_de_rand_1_bin_radiuslimited(), # selection of optimization method path_to_results="NA", # path where save results path_to_plot="NA", # path where to save Plots loss_type="RE", # string of the type of the used loss smoothing=true, # 1 do smoothing of data with rolling average type_of_smoothing="lowess", display_plots=true,# display plots in julia or not save_plots=false, write_res=false, # write results pt_avg=1, # number of points to do smoothing average pt_smooth_derivative=0, # number of points to do ssmooth_derivative do_blank_subtraction="avg_blank", # string on how to use blank (NO,avg_subtraction,time_avg) avg_replicate=false, # if true the average between replicates is fitted. If false all replicate are fitted indipendelitly correct_negative="thr_correction", # if "thr_correction" it put a thr on the minimum value of the data with blank subracted, if "blank_correction" uses blank distrib to impute negative values thr_negative=0.01, # used only if correct_negative == "thr_correction" multiple_scattering_correction=false, # if true uses the given calibration curve to fix the data method_multiple_scattering_correction="interpolation", calibration_OD_curve="NA", # the path to calibration curve to fix the data PopulationSize=100, maxiters=2000000, abstol=0.00001, thr_lowess=0.05, dectect_number_cdp= false, fixed_cpd = true, penality_CI=4.0, beta_smoothing_ms = 2.0, verbose =false, win_size=14, # number of the point of cpd sliding win

)

ang-one commented 9 months ago

What I am saying that focusing a lot on dataset LG110 to do the required tests i do not known if it is worth, I am not saying to not do the tests. This kind of behaviour very far away to any model. To run stability and inference tests between the different methods it is not better to choose a plate with less "strange" behaviour, otherwise we will encounter more bad performances or unstable situations. BTW I prepare all the script for the test and we can change dataset very easily

ang-one commented 9 months ago

e.g. LG 119 it is better

ang-one commented 9 months ago

In general what I mean. I have to create a benchmark/comparison of all the methods on synth data and real data. It is not better to present good benchmark because using data as LG110 brings more difficulty

fepinheiromycin commented 8 months ago

Sure, there is no need to benchmark the whole thing in all datasets (it is a validation of the code). Use a dataset you like, the one from Shaknovich paper or LG119. LG110 is good to understand very nonstandard things (we might even show an example on how the code does in these cases, because no other code that fits microbial growth cover this). But to benchmark, go with one dataset you like.

I like LG119. But how is the continuity being enforced? It looks like it's fitting segment-wise and plotting without continuity (from the pics above).

ang-one commented 8 months ago

in a while I will add in the other repo the segmentation test required on dataset 119

The continuity for NL fit is enforced backward with a multi objective loss = loss_fit + penality * loss_bounduary

we discussed it, it is the easiest way to implement it but i think is sub-optimal, also for the moment i have no idea how to improve it in general for any possible function

Here I comment while developing it there are various example also of the correct initialization of NL given a ODE simulation (https://github.com/pinheiroGroup/JMAKi.jl/issues/4#issuecomment-1945806394)

fepinheiromycin commented 8 months ago

By using this loss = loss_fit + penality * loss_bounduary, are you adding like infinite penalty if the values of the fitted left and right functions (around a segment) are different?

ang-one commented 8 months ago

It is not infinite, otherwise the convergence is practically impossible (example correspond to penalty = \inf) but loss_bounduary is never exactly 0.0.

I add a fake point on only one boundary for the segments (if they are in the middle) that are the last point of the previous fit and the loss is

loss = loss_fit + loss_bounduary loss_bounduary = error_fake_point * n_data / input_of_user

In this way if input_of_user =2 mean that error of the first point weights the half of the total dataset/segment

I the example before I was trying the over fitting if you tune input_of_user you can get good continuity I the othert issue there are some example with very high penality where you get practically flat line and also with "good" penality

This is sub-optimal