marchtaylor / sinkr

A collection of functions with emphasis on multivariate methods and handling of geographic datasets
GNU General Public License v3.0
36 stars 12 forks source link

DINEOF to Satellite-Derived Chlorophyll-a #11

Closed martinrivarosa closed 1 year ago

martinrivarosa commented 1 year ago

Hello,

First of all, congratulations for the package, it is very useful for my work. I am a biologist, I work with chlorophyll concentration.

The question: Use the DINEOF function to populate a time series of satellite images of chlorophyll concentration. The process works fine. However, I noticed that if I ran it again, it does so by returning a different EOF number than previously (as if it were randomly chosen). I think this happens when you don't specify any argument to execute the function (specifically the "ref.pos" argument, which chooses it at random). I don't know how to determine "ref.pos" with vector of non-gap reference positions. How should I determine the " n.max" (EOF numbers that the DINEOF method should use) and the "ref.pos"?

On the other hand, I understand that the lower the value of "delta.rms" the better the resolution of the interpolation: am I correct? What is the minimum value for "delta.rms" that can be chosen?

Thank you very much for your help, regards, Martin.

marchtaylor commented 1 year ago

Hello Martin, Thanks for your kind comments regarding the package.

Yes, unless you set a random seed (e.g. set.seed(1111)) before calling dineof, you will indeed get different results each time due to the different randomly generated ref.pos vector. I don't have any good recommendations for choosing this manually. I suppose their is a possible argument for making sure these are well distributed throughout the observed values of your dataset (analogous to cross-validation folds). If you have a large dataset, then I don't imagine this will be a large influencing factor though. An alternate approach that you could consider is to perform dineof several times, and use some averaged value for the interpolated gaps. Regarding the setting for n.max - I would leave this alone (i.e. searching for the full set of patterns) unless you are running up against computation time and don't want to deal with the latter, less important PCs. delta.rms is somewhat dependent on the units of your dataset, and thus it's a bit of a subjective choice. The algorithms will simply stop earlier if you set this lower, so it's a bit dependent on how precise you want the patterns to fit the observed data.

One more important aspect when using dineof (or EOF/PCA) on chorophyll data - Your data is likely to be quite skewed due to lots of low values, and the lack of negative concentrations, and thus you may want to log-transform the concentrations beforehand (and untransform afterwards).

Hope that helps, Marc

martinrivarosa commented 1 year ago

Marc,

Thank you very much for the quick answer. It was a great help to me.

Regards, Martin.


El 2023-02-07 07:47, Marc Taylor escribió:

Hello Martin, Thanks for your kind comments regarding the package.

Yes, unless you set a random seed (e.g. set.seed(1111)) before calling dineof, you will indeed get different results each time due to the different randomly generated ref.pos vector. I don't have any good recommendations for choosing this manually. I suppose their is a possible argument for making sure these are well distributed throughout the observed values of your dataset (analogous to cross-validation folds). If you have a large dataset, then I don't imagine this will be a large influencing factor though. An alternate approach that you could consider is to perform dineof several times, and use some averaged value for the interpolated gaps. Regarding the setting for n.max - I would leave this alone (i.e. searching for the full set of patterns) unless you are running up against computation time and don't want to deal with the latter, less important PCs. delta.rms is somewhat dependent on the units of your dataset, and thus it's a bit of a subjective choice. The algorithms will simply stop earlier if you set this lower, so it's a bit dependent on how precise you want the patterns to fit the observed data.

One more important aspect when using dineof (or EOF/PCA) on chorophyll data - Your data is likely to be quite skewed due to lots of low values, and the lack of negative concentrations, and thus you may want to log-transform the concentrations beforehand (and untransform afterwards).

Hope that helps, Marc

-- Reply to this email directly, view it on GitHub [1], or unsubscribe [2]. You are receiving this because you authored the thread.Message ID: @.***>

Links:

[1] https://github.com/marchtaylor/sinkr/issues/11#issuecomment-1420567764 [2] https://github.com/notifications/unsubscribe-auth/A5W3M2GJBIOHPS2ON3NQBNLWWIR3BANCNFSM6AAAAAAUTG6O3M

martinrivarosa commented 1 year ago

Hello Marc,

Thank you very much for your reply. I was able to clear up a doubt. I wanted to ask you if my interpretation of the steps to execute the Dineof is correct: 1) Transform the satellite images of chlorophyll that make up the time series into a matrix 2) The temporal and spatial average should then be subtracted from the chlorophyll values. This step is that I do not understand. Is it necessary to calculate the mean for each pixel and subtract it from all the pixels corresponding to the time series? It is necessary? 3) The value 0 must be assigned to the pixels without NA data. If I do this, how does the script identify the missing values to fill them in? 4) The values must be logarithmically transformed (for example to a base 10 logarithm) 5) Finally execute the DINEOF of the Sinkr package

Thank you very much, best regards, Martin.


El 2023-02-07 07:47, Marc Taylor escribió:

Hello Martin, Thanks for your kind comments regarding the package.

Yes, unless you set a random seed (e.g. set.seed(1111)) before calling dineof, you will indeed get different results each time due to the different randomly generated ref.pos vector. I don't have any good recommendations for choosing this manually. I suppose their is a possible argument for making sure these are well distributed throughout the observed values of your dataset (analogous to cross-validation folds). If you have a large dataset, then I don't imagine this will be a large influencing factor though. An alternate approach that you could consider is to perform dineof several times, and use some averaged value for the interpolated gaps. Regarding the setting for n.max - I would leave this alone (i.e. searching for the full set of patterns) unless you are running up against computation time and don't want to deal with the latter, less important PCs. delta.rms is somewhat dependent on the units of your dataset, and thus it's a bit of a subjective choice. The algorithms will simply stop earlier if you set this lower, so it's a bit dependent on how precise you want the patterns to fit the observed data.

One more important aspect when using dineof (or EOF/PCA) on chorophyll data - Your data is likely to be quite skewed due to lots of low values, and the lack of negative concentrations, and thus you may want to log-transform the concentrations beforehand (and untransform afterwards).

Hope that helps, Marc

-- Reply to this email directly, view it on GitHub [1], or unsubscribe [2]. You are receiving this because you authored the thread.Message ID: @.***>

Links:

[1] https://github.com/marchtaylor/sinkr/issues/11#issuecomment-1420567764 [2] https://github.com/notifications/unsubscribe-auth/A5W3M2GJBIOHPS2ON3NQBNLWWIR3BANCNFSM6AAAAAAUTG6O3M