pgf-tikz / pgfplots

pgfplots - A TeX package to draw normal and/or logarithmic plots directly in TeX in two and three dimensions with a user-friendly interface and pgfplotstable - a TeX package to round and format numerical tables. Examples in manuals and/or on web site.
http://pgfplots.sourceforge.net/
200 stars 35 forks source link

[Feature Request] More kinds of regression / best-fit curve than just linear regression #351

Open JasonGross opened 4 years ago

JasonGross commented 4 years ago

Brief outline of the proposed feature

It seems nice to be able to fit data to something other than just a line. (Note that the log mode for linear regression isn't really satisfactory, because it uses the wrong weighting, resulting in under-weighing the larger points.) While this may be out-of-scope for doing inline in TeX, it seems relatively easy to add support for quadratic and exponential fits using the gnuplot feature of PGF.

Usage example

Here is some code that demonstrates how I'm currently achieving this:

\documentclass{article}
\usepackage[margin=0.5in]{geometry}
\usepackage{tikz}
\usepackage{pgfplots}
\pgfplotsset{compat=1.17}
\usepackage{gnuplottex}
\usepackage{pgfkeys}
\usepackage{pgfplotstable}
\usepackage{xparse}
\usepackage{xstring}
\makeatletter
% https://tex.stackexchange.com/a/50113/2066
\newcommand*{\IsInteger}[3]{%
    \IfStrEq{#1}{ }{%
        #3% is a blank string
    }{%
    \IfInteger{#1}{#2}{#3}%
}%
}%
\newcommand{\pgftognucolumnset}[2]{%
    \IsInteger{\pgfkeysvalueof{#1}}{%
        % pgf 0-indexes columns, while gnuplot 1-indexes columns, so we add 1 to adjust
        \edef#2{\the\numexpr\pgfkeysvalueof{#1}+1\relax}%
    }{%
    \edef#2{(column("\pgfkeysvalueof{#1}"))}%
}%
}
\makeatletter
% \addplotexponentialregression[params for \addplot][default settings for a and b, also for x and y columns]{table file}
\NewDocumentCommand{\addplotexponentialregression}{ O{no markers} o m}{%
    \pgfkeyssetvalue{/addplotexponentialregression/x}{0}
    \pgfkeyssetvalue{/addplotexponentialregression/y}{1}
    \pgfkeyssetvalue{/addplotexponentialregression/a}{1}
    \pgfkeyssetvalue{/addplotexponentialregression/b}{1}
    \pgfkeys{/addplotexponentialregression/.cd,#2}
    \pgftognucolumnset{/addplotexponentialregression/x}{\@addplotexponentialregression@colx}%
    \pgftognucolumnset{/addplotexponentialregression/y}{\@addplotexponentialregression@coly}%
    \edef\@addplotexponentialregression@inita{\pgfkeysvalueof{/addplotexponentialregression/a}}%
    \edef\@addplotexponentialregression@initb{\pgfkeysvalueof{/addplotexponentialregression/b}}%
    \addplot [#1] gnuplot [raw gnuplot] { % allows arbitrary gnuplot commands
        f(x) = a*exp(b*x);     % Define the function to fit
        % Set reasonable starting values here
        a=\@addplotexponentialregression@inita;
        b=\@addplotexponentialregression@initb;
        fit f(x) '#3' u \@addplotexponentialregression@colx:\@addplotexponentialregression@coly\space via a,b; % Select the file
        stats '#3' u \@addplotexponentialregression@colx;
        plot [x=STATS_min:STATS_max] f(x);    % Specify the range to plot
        set print "#3-parameters.dat";  % Open a file to save the parameters
        print a, b;                  % Write the parameters to file
    };
    \pgfplotstableread{#3-parameters.dat}\parameters % Open the file Gnuplot wrote
    \pgfplotstablegetelem{0}{0}\of\parameters \xdef\pgfplotstableregressiona{\pgfplotsretval} % Get first element, save into \pgfplotstableregressiona
    \pgfplotstablegetelem{0}{1}\of\parameters \xdef\pgfplotstableregressionb{\pgfplotsretval}
}
% \addplotquadraticregression[params for \addplot][default settings for a and b and c, also for x and y columns]{table file}
\NewDocumentCommand{\addplotquadraticregression}{ O{no markers} o m}{%
    \pgfkeyssetvalue{/addplotquadraticregression/x}{0}
    \pgfkeyssetvalue{/addplotquadraticregression/y}{1}
    \pgfkeyssetvalue{/addplotquadraticregression/a}{1}
    \pgfkeyssetvalue{/addplotquadraticregression/b}{1}
    \pgfkeyssetvalue{/addplotquadraticregression/c}{1}
    \pgfkeys{/addplotquadraticregression/.cd,#2}
    \pgftognucolumnset{/addplotquadraticregression/x}{\@addplotquadraticregression@colx}%
    \pgftognucolumnset{/addplotquadraticregression/y}{\@addplotquadraticregression@coly}%
    \edef\@addplotquadraticregression@inita{\pgfkeysvalueof{/addplotquadraticregression/a}}%
    \edef\@addplotquadraticregression@initb{\pgfkeysvalueof{/addplotquadraticregression/b}}%
    \edef\@addplotquadraticregression@initc{\pgfkeysvalueof{/addplotquadraticregression/c}}%
    \addplot [#1] gnuplot [raw gnuplot] { % allows arbitrary gnuplot commands
        f(x) = a*x**2+b*x+c;     % Define the function to fit
        % Set reasonable starting values here
        a=\@addplotquadraticregression@inita;
        b=\@addplotquadraticregression@initb;
        c=\@addplotquadraticregression@initc;
        fit f(x) '#3' u \@addplotquadraticregression@colx:\@addplotquadraticregression@coly\space via a,b,c; % Select the file
        stats '#3' u \@addplotquadraticregression@colx;
        plot [x=STATS_min:STATS_max] f(x);    % Specify the range to plot
        set print "#3-parameters.dat";  % Open a file to save the parameters
        print a, b, c;                  % Write the parameters to file
    };
    \pgfplotstableread{#3-parameters.dat}\parameters % Open the file Gnuplot wrote
    \pgfplotstablegetelem{0}{0}\of\parameters \xdef\pgfplotstableregressiona{\pgfplotsretval} % Get first element, save into \pgfplotstableregressiona
    \pgfplotstablegetelem{0}{1}\of\parameters \xdef\pgfplotstableregressionb{\pgfplotsretval}
    \pgfplotstablegetelem{0}{2}\of\parameters \xdef\pgfplotstableregressionc{\pgfplotsretval}
}
% regress on y = a (x!)
% \addplotfactorialregression[params for \addplot][default settings for a, also for x and y columns]{table file}
\NewDocumentCommand{\addplotfactorialregression}{ O{no markers} o m}{%
    \pgfkeyssetvalue{/addplotquadraticregression/x}{0}
    \pgfkeyssetvalue{/addplotquadraticregression/y}{1}
    \pgfkeyssetvalue{/addplotquadraticregression/a}{1}
    \pgfkeys{/addplotquadraticregression/.cd,#2}
    \pgftognucolumnset{/addplotquadraticregression/x}{\@addplotquadraticregression@colx}%
    \pgftognucolumnset{/addplotquadraticregression/y}{\@addplotquadraticregression@coly}%
    \edef\@addplotquadraticregression@inita{\pgfkeysvalueof{/addplotquadraticregression/a}}%
    \addplot [#1] gnuplot [raw gnuplot] { % allows arbitrary gnuplot commands
        f(x) = a*gamma(x+1);     % Define the function to fit
        % Set reasonable starting values here
        a=\@addplotquadraticregression@inita;
        fit f(x) '#3' u \@addplotquadraticregression@colx:\@addplotquadraticregression@coly\space via a; % Select the file
        stats '#3' u \@addplotquadraticregression@colx;
        plot [x=STATS_min:STATS_max] f(x);    % Specify the range to plot
        set print "#3-parameters.dat";  % Open a file to save the parameters
        print a;                  % Write the parameters to file
    };
    \pgfplotstableread{#3-parameters.dat}\parameters % Open the file Gnuplot wrote
    \pgfplotstablegetelem{0}{0}\of\parameters \xdef\pgfplotstableregressiona{\pgfplotsretval} % Get first element, save into \pgfplotstableregressiona
}
\makeatother
\begin{document}
%python3 -c 'import math, random; fact0 = (lambda fact0, n: n * fact0(fact0, n - 1) if n > 0 else 1); fact = (lambda n: fact0(fact0, n)); print("\n".join(f"{n:<2} {3*n*n+n+1:<10} {2*math.exp(0.5*n):<20} {2*fact(n):<15} {math.exp(n)+int(random.uniform(0,100)):<15}" for n in range(1, 12)))'
\begin{filecontents*}{data.dat}
n  q          e                    f               eapprox
1  5          3.2974425414002564   2               97.71828182845904
2  15         5.43656365691809     4               101.38905609893065
3  31         8.963378140676129    12              87.08553692318768
4  53         14.7781121978613     48              92.59815003314424
5  81         24.364987921406946   240             194.4131591025766
6  115        40.171073846375336   1440            410.4287934927351
7  155        66.23090391738462    10080           1174.6331584284585
8  201        109.19630006628847   80640           3032.9579870417283
9  253        180.03426260104362   725760          8198.083927575384
10 311        296.8263182051532    7257600         22124.465794806718
11 375        489.38386452844077   79833600        59973.14171519782
\end{filecontents*}
\begin{tikzpicture}
    \begin{axis}[xlabel=$x$,ylabel=$y$,legend pos=north west,axis lines=left]
        \addplot[only marks,color=black] table[x=n,y=q]{data.dat};
        \addplotquadraticregression[no markers,black][x=n,y=q]{data.dat};
        \addlegendentry{data}
        \addlegendentry{
            $\pgfmathprintnumber{\pgfplotstableregressiona}x^2
            \pgfmathprintnumber[print sign]{\pgfplotstableregressionb}x
            \pgfmathprintnumber[print sign]{\pgfplotstableregressionc}$
        }
    \end{axis}
\end{tikzpicture}
\begin{tikzpicture}
\begin{axis}[xlabel=$x$,ylabel=$y$,legend pos=north west,axis lines=left]
\addplot[only marks,color=black] table[x=n,y=e]{data.dat};
\addplotexponentialregression[no markers,black][x=n,y=e]{data.dat};
\addlegendentry{data}
\addlegendentry{
    $\pgfmathprintnumber{\pgfplotstableregressiona}e^{\pgfmathprintnumber{\pgfplotstableregressionb}x}$
}
\end{axis}
\end{tikzpicture}
\\
\begin{tikzpicture}
\begin{axis}[xlabel=$x$,ylabel=$y$,legend pos=north west,axis lines=left]
\addplot[only marks,color=black] table[x=n,y=f]{data.dat};
\addplotfactorialregression[no markers,black][x=n,y=f]{data.dat};
\addlegendentry{data}
\addlegendentry{
    $\pgfmathprintnumber{\pgfplotstableregressiona}x!$
}
\end{axis}
\end{tikzpicture}
\\
\begin{tikzpicture}
\begin{axis}[xlabel=$x$,ylabel=$y$,legend pos=north west,axis lines=left]
\addplot[only marks,color=black] table[x=n,y=eapprox]{data.dat};
\addplotexponentialregression[no markers,black][x=n,y=eapprox]{data.dat};
\edef\gnua{\pgfplotstableregressiona}
\edef\gnub{\pgfplotstableregressionb}
\addlegendentry{data}
\addlegendentry{
    $\pgfmathprintnumber{\gnua}e^{\pgfmathprintnumber{\gnub}x}$
}
\end{axis}
\end{tikzpicture}
\begin{tikzpicture}
\begin{axis}[xlabel=$x$,ylabel=$y$,legend pos=north west,axis lines=left,ymode=log]
\addplot[only marks,color=black] table[x=n,y=eapprox]{data.dat};
\addplotexponentialregression[no markers,black][x=n,y=eapprox]{data.dat};
\edef\gnua{\pgfplotstableregressiona}
\edef\gnub{\pgfplotstableregressionb}
\addplot[no markers,red,thick] table [x=n,y={create col/linear regression={y=eapprox,ymode=log}}]
{data.dat};
\edef\ra{\pgfplotstableregressiona}
\edef\rb{\pgfplotstableregressionb}
\addlegendentry{data}
\addlegendentry{
    $\pgfmathprintnumber{\gnua}e^{\pgfmathprintnumber{\gnub}x}$
}
\addlegendentry{
    $\pgfmathprintnumber{\ra}e^{\pgfmathprintnumber{\rb}x}$
}
\end{axis}
\end{tikzpicture}
\end{document}

This creates the following plots: test2-1

(The final log plot shows the inadequacy of the ymode=log argument to linear regressions.)

I'm not sure what the best generalization of this is (should the user be able to specify any gnuplot function and which variables to fit?), but at least the quadratic and exponential regressions would be quite useful to have as syntax via create col/exponential regression and create col/quadratic regression.

(Presumably the code I wrote would need to be cleaned up somewhat to use more standard file names for the parameters (maybe based on the id key?), to emit the appropriate warning/error when gnuplot isn't available, and to be accessible via create col rather than as their own commands.

Mo-Gul commented 4 years ago

My guess is that this won't be implemented, because it relies on gnuplot -- in contrast to the linear regression. If at least there would be solution using Lua than there might be a chance, because Lua(La)TeX should be available on most of the installations today.

(Maybe there is also a possibility to use xfp package to do the job?)

hmenke commented 4 years ago

My guess is that this won't be implemented, because nowadays Christian only reacts when I directly send him a ready-to-merge patch and remind him several times. Remember the circle disaster? My PR #347 is also open without any comment for over a month now.

JasonGross commented 4 years ago

@Mo-Gul There should be a way to do the regression in lua/xfp. I just figured that since there's some support for gnuplot integrated into pgfplots, it'd be reasonable to add this support conditional on gnuplots being around. In any case, I think the log-mode issue with linear regression is a bug, and can probably be fixed by adjusting the default variances to log-scale when in log-mode. Should I open a separate report for this?

@hmenke Ah, that's unfortunate. Maybe I should just make my own package that builds on top of gnuplottex and pgfplots? (In any case, I'd appreciate help in figuring out how to extend create col so that the syntax for these is more in line with syntax for linear regression.)

Mo-Gul commented 4 years ago

If there isn't already an issue filed for this (I didn't check it) then you should add a new/separate for this, yes please.

I would be a fan of adding additional regression functions as well. But I am just a user of LaTeX and can't program anything myself. But if someone would do it, I'll be happy to test and proofe-read the documentation. If possible, I suggest that you try to implement this -- as already stated in my previous comment -- in Lua and/or xfp because this is available "by default" in newer TeX installations and natively supported. Gnuplot on the other hand needs a LaTeX run with --shell-escape enabled and other "problems"/prerequisites.

Of course you also create your own package where you can do whatever you want, but I think "visibility" for these features will be much higher, if they are shipped (as library) with PGFPlots directly instead of an additional package.

JasonGross commented 4 years ago

Separate issue reported at #352

JasonGross commented 4 years ago

I almost have code that is sufficiently working that I could make my own package (reimplementation of the fitting algorithm with xfp or lua will have to wait for when I have more free time). Currently I'm just blocked on some styling issues of \addplot: https://tex.stackexchange.com/questions/568875/how-do-i-get-all-options-to-addplot-to-propagate-to-an-inner-addplot-executed