tildekara / Raman-spectroscopy---data-analysis-in-Python-and-autonatic-export-to-txt-files

Python script to perform data analysis of Raman spectrum with fitting Lorentz curve with genetic algorithm of initial parameters to chosen mode and automatic export to txt file. This example was made in order to annalyse carbon nanotubes single point, multiple acquisitions measurement data.
7 stars 0 forks source link

Initial parameters hard-coded, stochastic search alternatives #1

Closed zunzun closed 7 years ago

zunzun commented 7 years ago

I see in your code that you have hard-coded lists of initial parameters for the scipy.optimize non-linear solver used in curve_fit. I have had good success using a genetic algorithm named Differential Evolution to find initial parameter values, and have seen others use Markov Chain Monte Carlo (MCMC) for the same purpose. Note that not all data points need to be used for this purpose - only max(X), max(Y), min(X), min(Y) and several other evenly spaced data points are needed, where the total number of data points used to find initial parameter values equals the total number of equation parameters plus a few extra points. Reducing the size of the data set used for stochastic initial parameter search in this way greatly improves performance by reducing the required number of calculations.

tildekara commented 7 years ago

Hello! First of all, let me thank you for taking your time to suggest improvements to my code. It's my very first issue pointed out on my very first "functional" program which I use whenever I need to perform data analysis after Raman spectroscopy measurements. This script was created in a rush as I needed to work on the data to my thesis and I'm still fairly new to Python so I worked with the simplest solutions I could find. I did encounter a problem with choosing the right initial parameters and for other materials I needed to change them during the run of the script which was very inconvenient (it crushes whenever it can't find optimal parameters). So I completly agree that it could use some improvements and your suggestion to use this genetic algorithm could solve some of the problems. I'll try to work on it. However, I'm still a student so I have to prepare for the exams and have too little time to try to implement it anytime soon I guess. So if you have any more suggestions or examples of similar codes fixed this way it could be easier for me to work on it and save me some time and I'd be glad to see them. In the meantime I'll mark this issue as "help wanted" untill I'm able to work on it myself after the exams. :-) Thanks again! Greetings, Karolina

zunzun commented 7 years ago

If you would add to your code repository an example data file to be fit, plus the expected final parameters testing and verification, I can cut-and-paste from my existing code and knock this out in a few days.

tildekara commented 7 years ago

I added 3 random data files.

zunzun commented 7 years ago

Excellent. In progress, I will keep you updated on my progress.

zunzun commented 7 years ago

I just read and ran your code with the example data. This looks like it will be easy as pie, very similar to other work I have done. I should be done by tomorrow.

tildekara commented 7 years ago

Good to know! I'm looking forward to see the results :)

zunzun commented 7 years ago

The work was easy to perform, the results disappointing.

To verify my code, If I made a test with the same initial parameters that you used. I received the same results you did, ehich means my coding of the equation was functionally correct.

I then ran with my standard genetic algorithm code alone. The results were very poor.

My next test used a very, very large genetic algorithm population size. This took a very long time to run. Again, the results were poor.

My conclusion is that this equation is extremely sensitive to initial parameter values, and that an enormous amount of computation time would be required for stochastic methods for this equation on these data sets.

My only idea is to split the data into two separate peaks, fit individual Lorentzians to each peak, and then combine those results as the initial parameter values for the Double Lorentzian. My experience is that fitting individual Lorentzians to single peaks is not so sensitive to initial parameter values.

zunzun commented 7 years ago

I was wrong in closing this. I had performed unbounded search, which means most of the genetic algorithm's calculations were wasted. For example: in the unbounded search that I performed, any equation parameter had an equal random chance of being positive or negative.

If any individual parameter value could only be positive, half of the genetic algorithm attempts for that parameter will fail. Your equation has 8 parameters. If all must be positive, then only 1/2^8, or 1 of every 256 attempts, can possibly succeed.

Even worse, the "center" parameter for any single peak (x_0 and x_01 in your equation) must be within the maximum and minimum values in the data set. In my unbounded searches, these could take any value and would most often be outside the range of possible values - in fact, in an unbounded search they will almost never be within the data range!

So if a search is only performed within the possible values, such as: a) within the range of the data b) only values greater than zero

then a stochastic search would be very likely to succeed. I have reopened this github issue and will try again.

tildekara commented 7 years ago

I'm glad to hear that there's still something that can be done about it. Good luck!

zunzun commented 7 years ago

I decided to use the University of Quebec's DEAP: Distributed Evolutionary Algorithms in Python

from https://github.com/DEAP because it: A) is academic open source code B) is free of charge C) can run in parallel for performance on multiple CPU computers D) comes with Jupyter (ipython) notebook examples E) comes with several different genetic algorithms already built in

zunzun commented 7 years ago

DEAP has a regression example here: http://deap.readthedocs.io/en/master/examples/gp_symbreg.html which is very easily adapted to this Raman spectroscopy problem.

In your Python code, I will detect the first spectroscopy data set - and for that first data set only I will use the bounded genetic algorithm to find initial parameter estimates to fit the data, and then use those values as the initial parameter estimates for all subsequent curve fitting.

Let's see how well it works!

zunzun commented 7 years ago

Development setup is complete and a GitHub repository created to hold the development files. The repository URL is https://github.com/zunzun/RamanSpectroscopyFit so that we can exchange files. If you install DEAP using the (Python 3) command:

pip3 install deap

then you can clone the repository and then run the test file RamanSpectroscopyFit.py from a command line. This file currently curve fits one spectroscopy data set using your original initial parameters and plots the results. I have not started using DEAP yet, but all is ready now for coding and testing to begin. You should be able to write to this repository, if I have set it up correctly.

tildekara commented 7 years ago

I just checked it and it works. Thank you very much! Now I'll be able to get it working with my previous version of my program. It's great because these initial parameters were causing me trouble and it was hard to find them. I guess the issue can be closed now as the only thing left for me is to mix two codes together :) Thank you again!

tildekara commented 7 years ago

I just checked it for different, not that close to expected ones, parameters and it's great! Seriously I really appreciate it!

zunzun commented 7 years ago

No, no - this is only the development set up - it is using YOUR original parameters! This is the basis for the future work, and does not yet do anything.

Please reopen the issue, it is not finished yet.

zunzun commented 7 years ago

OK!

zunzun commented 7 years ago

After working with DEAP, I found that it is not useful, I found that the code and documentation are of poor quality.

Recent versions of Python's scipy module contain the Differential Evolution genetic algorithm, which is perfect. Please let me know what version of scipy you use. From a command prompt, here is my laptop's version:

import scipy scipy.version.version '0.17.0'

tildekara commented 7 years ago

It's the same for me - 0.17.0.

zunzun commented 7 years ago

Most excellent. Thank you

zunzun commented 7 years ago

Official technical reference from the scipy documentation:

https://docs.scipy.org/doc/scipy-0.17.0/reference/generated/scipy.optimize.differential_evolution.html

zunzun commented 7 years ago

Success - initial code working - I love scipy! Please look through the most recent version of the file RamanSpectroscopyFit.py and verify that it does NOT contain any initial parameters, then run the file from a command line to verify correct operation of the genetic algorithm. It takes 1 minute 5 seconds for the genetic algorithm to run, on my laptop.

Notes: 1) if initial result looks OK to you, the next step is to make the development code easier for you to use in your work 2) Please review the parameter bounds, I tried to guess good bounds - but you know this problem domain better than I do. Current bounds are based on the data min and max values. 3) This code has only been tested with a single data set.

zunzun commented 7 years ago

When the genetic algorithm runs, some population members can generate warnings when evaluated - these can be safely ignored, so I turned Python warnings off for this part of the code.

tildekara commented 7 years ago

Thank you! Works like charm! I'll apply it to my code after I finish my exams in a few days. It'll be really helpful in my work analysing speectroscopic data.

zunzun commented 7 years ago

Excellent news. At some future time you might use my other Python curve fitting and surface fitting open source projects (on GitHub) in your work, they are most useful and the 3D surface plots look cool, too! I have a great deal of multi-national experience in curve fitting and surface fitting, having been a nuclear engineer, radiation physics engineer, and a software engineer. I'm always glad to help, so don't forget your open-source professional colleague across the Atlantic when you have a difficult problem. Remember, if you did not open source your code this would not have happened.

tildekara commented 7 years ago

Thank you! I'll surely check your other projects of curve fitting. I'm really glad to have recieved your help and see another phycisist engineer interested in such data analysis (although I specialize in nanotechnology not nuclear physics so it's also interesting to see what things are in common in different fields). Thanks again and good luck with other projects!