uncscode / particula

a simple, fast, and powerful particle simulator
https://uncscode.github.io/particula
MIT License
6 stars 9 forks source link

Refactor ML size distribution fitting #486

Open Gorkowski opened 1 month ago

Gorkowski commented 1 month ago

issue (complexity): Consider simplifying the code structure and exploring alternative approaches to the neural network implementation.

While the implementation of a neural network for this curve-fitting problem is thorough, it introduces significant complexity. Consider the following suggestions to improve maintainability and potentially simplify the approach:

Simplify the code structure: Break down large functions like generate_simulated_data into smaller, focused functions. Use type hints consistently to improve readability and catch potential errors. def generate_mode_indices(total_number: int, num_modes: int, max_index: int) -> NDArray[np.int64]: mode_indices = np.random.randint(0, max_index - 1, [total_number, num_modes]) return np.sort(mode_indices, axis=1)

def generate_gsds(total_number: int, num_modes: int, lower_bound: float, upper_bound: float) -> NDArray[np.float64]: gsds = np.random.uniform(lower_bound, upper_bound, [total_number, num_modes]) return np.sort(gsds, axis=1) Consider alternative approaches: Evaluate if a simpler statistical method or curve-fitting algorithm could achieve similar results. For example, you might explore using scipy's curve_fit function with a custom multi-modal lognormal function. from scipy.optimize import curve_fit

def bimodal_lognormal(x, *params):

Implement bimodal lognormal function

pass

popt, _ = curve_fit(bimodal_lognormal, x_data, y_data, p0=initial_guess) If ML is necessary, simplify the model: Consider using a simpler model architecture or a different ML approach like Random Forests, which might be easier to interpret and maintain. from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=42) model.fit(X_train, y_train) Improve documentation: Add more context about why this complex ML approach is needed. Document the expected input and output formats clearly. def lognormal_2mode_ml_guess( logspace_x: NDArray[np.float64], concentration_pdf: NDArray[np.float64] ) -> Tuple[NDArray[np.float64], NDArray[np.float64], NDArray[np.float64]]: """ Predict lognormal distribution parameters using a pre-trained ML model.

This complex approach is necessary due to [explain reasons here].

Args:
    logspace_x: Array of particle sizes in log space.
    concentration_pdf: Probability density function of particle concentrations.

Returns:
    Tuple containing:
    - Predicted mode values
    - Predicted geometric standard deviations
    - Predicted number of particles
"""
# Implementation...

These changes would make the code more maintainable and easier to understand while preserving the ML-based approach if it's truly necessary.

Gorkowski commented 1 month ago

https://github.com/uncscode/particula/pull/474#discussion_r1787068047