josephwright / etoolbox

Tool-box for LaTeX programmers using e-TeX
LaTeX Project Public License v1.3c
41 stars 7 forks source link

Parsers declared with `\DeclareListParser` appear to have quadratic behavior #18

Closed stevecheckoway closed 6 years ago

stevecheckoway commented 6 years ago

I was looking at the implementation of \docsvlist and \forcsvlist and trying to understand the purpose of the \space macros (possibly to prevent stripping off too many layers of braces?) and I noticed that the parsers have running time quadratic in the length of the list.

The issue is the check if there are any more elements in the list starting on lines 1508 and 1524. The \ifblank is going to \detokenize the rest of the list, and then assuming it isn't blank, it has to put the tokens back.

It's slightly difficult to talk about these macros in general, so I'll focus on \docsvlist. If I understand this correctly, \docsvlist{a,b,c} expands to \etb@lst@docsvlist a,b,c,& (where the aforementioned \space is what's causing the extra space after \etb@lst@docsvlist).

Rather than appending ,&, \docsvlist should be able to append something like ,\etb@lst@sentinel,&. Then, \etb@lst@docsvlist can be defined like

\long\def\etb@lst@docsvlist#1,#2,{%
  \etb@listitem\do{#1}%
  \ifx\etb@lst@sentinel#2\empty
    \expandafter\@firstoftwo
  \else
    \expandafter\@secondoftwo
  \fi
    {\listbreak}%
    {\expandafter\etb@lst@docsvlist\space#2,}%
}

This implementation appears to be expandable just like the original, but simpler. The only change is looking for the sentinel rather than checking if the rest of the list is blank.

f.txt is a simple version demonstrating the code and demonstrating the expandable nature of the list.

t.txt computes how long it takes to iterate through a 500-element comma-separated list using both the etoolbox code and mine. It takes the mean of 1000 runs which is probably over kill, but on my (several year old) machine, the mean time for the original code is 2558/2^16 seconds and 70/2^16 seconds for my code.

If you change the number of elements in the list, the quadratic run-time becomes apparent.

I realize that by TeX's very nature, lists are going to take quadratic time for a bunch of operations that should be linear time (or amortized constant time) in a sane programming language. But I don't think iterating over a list is one of them.

josephwright commented 6 years ago

Your analysis is right, but that doesn't necessarily mean that a change is likely.

As you may know, I'm looking after etoolbox as the original author (Philipp Lehman) 'dropped off the radar' some years ago. There were a few changes that were essential, so someone had to pick up the package. What I'm therefore mindful of is that changes need to be in line with Philipp's original ideas. Here, he chose a particular approach, and I'm wary of change.

At the same time, I'm on the LaTeX3 Project, and we have a 'bigger' set of programming tools (expl3). I have to be careful that etoolbox doesn't end up too much as simply 'a second version of the same stuff'. If you look at the (for example) comma list mapping implementation in expl3 is uses exactly the approach you suggest, with a marker, though there are some subtle differences as brace stripping issues are not the same.

I'll need to think about this: the question is not technical but rather one of policy.

stevecheckoway commented 6 years ago

Makes sense.

I glanced quickly at etextools and pgffor to see if I could figure out what their loops' behaviors are but both are inscrutable (for entirely different reasons).

josephwright commented 6 years ago

I'd stay away from etextools: it's not maintained and there are some long-standing issues.

stevecheckoway commented 6 years ago

That's good to know and also too bad since it seems to add some missing functionality around comma-separated lists. It's probably time to bite the bullet and learn expl3.

josephwright commented 6 years ago

@stevecheckoway I'm happy to consider new functionality in etoolbox if it's within the scope of the package. Ideally, the guy behind etextools would adjust his code such that I can move 'core' ideas to etoolbox, but that has not happened to date.

I think on balance I will look at the performance here: I doubt Philipp intended to pick a slow method.

stevecheckoway commented 6 years ago

\etb@defparser@arg has the same quadratic behavior and I think \etb@forlistloop@i does as well, but I'm running out the door and I merely glanced at it just now.

stevecheckoway commented 6 years ago

My implementation idea doesn't work exactly as specified. The problem is that looking for the sentinel causes braces to be stripped. \docsvlist{a,{b,c}} and \docsvlist{a, {b,c}} now have different behavior. The first expands to (essentially) \do{a}\do{b}\do{c} with the new definition and the second to \do{a}\do{b,c}.

The list macros have other brace issues, namely \docsvlist{{a}b,c} expands to \do{ab}\do{c} which isn't ideal. But that behavior doesn't justify the stripping braces around {b,c} which drastically changes the meaning.

I think I figured out an approach that works. I implemented it for the internal lists first and then adapted it to the comma-separated lists. More effort is involved in the comma-separated list case.

The internal list macros strip one layer of braces already, but I think with some care, the sentinel approach should work here as long as braces are added around the part that's being put back in the input. Something like this.

\long\def\etb@forlistloop#1#2{\etb@forlistloop@i{#2}#1|\etb@lst@sentinel|&}%
\long\def\etb@forlistloop@i#1#2|#3|{%
  \ifblank{#2}%
    {}%
    {#1{#2}}%
  \ifx\etb@lst@sentinel#3\@empty
    \expandafter\@firstoftwo
  \else
    \expandafter\@secondoftwo
  \fi
    {\listbreak}%
    {\etb@forlistloop@i{#1}{#3}|}%
}%

In the second to last line, the braces around #3 will be stripped by \etb@forlistloop@i but they're needed to prevent two layers of braces from being stripped. I.e., given

\def\foo{}
\listadd\foo{{{foo}}}
\listadd\foo{{{bar}}}

\dolistloop\foo expands to \do{{foo}}\do{{bar}} Using either \listadd\foo{foo} or \listadd\foo{{foo}} instead expands to \do{foo}. I believe my implementation preserves that behavior.

Fixing the comma-separated list versions take more work. But not a whole lot more. The key idea is to change \etb@lst@docsvlist and \etb@lst@forcsvlist such that the first argument contains the leading-space-stripped current item in the list. Here's my code.

\def\if@etb@sentinel#1{%
  \expandafter\ifx\expandafter\etb@lst@sentinel\@firstofone#1\@empty
    \expandafter\@firstoftwo
  \else
    \expandafter\@secondoftwo
  \fi
}

\def\etb@defparser@do#1#2{%
  \begingroup
  \edef\@tempa{%
    \endgroup
    \long\def\noexpand#1####1{%
      % Start with an empty (not just blank) first argument.
      \expandafter\noexpand
      \csname etb@lst@\expandafter\@gobble\string#1\endcsname
      {}\space####1\noexpand#2\noexpand\etb@lst@sentinel\noexpand#2&%
    }%
    \long\csdef{etb@lst@\expandafter\@gobble\string#1}####1####2\noexpand#2{%
      % If the argument is empty, there's nothing to do. Otherwise `\do{#1}`
      \noexpand\ifstrempty{####1}%
        {}%
        {\noexpand\do{####1}}%
      % If it's blank, it isn't the sentinel so pass an empty first argument to
      % this macro. Otherwise, check for the sentinel and break if
      % found, otherwise, expand the macro again, but this time strip leading
      % spaces from ####2 using the `\@firstofone`.
      \noexpand\ifblank{####2}%
        {%
          \expandafter\noexpand
          \csname etb@lst@\expandafter\@gobble\string#1\endcsname{}\space
        }%
        {%
          \noexpand\if@etb@sentinel{####2}%
            {\noexpand\listbreak}%
            {%
              \noexpand\expandafter
              \expandafter\noexpand
              \csname etb@lst@\expandafter\@gobble\string#1\endcsname
              \noexpand\expandafter{\noexpand\@firstofone####2}\space
            }%
        }%
    }%
  }%
  \@tempa
}

\def\etb@defparser@arg#1#2{%
  \begingroup
  \edef\@tempa{%
    \endgroup
    \long\def\noexpand#1####1####2{%
      % Start with an empty (not just blank) first argument.
      \expandafter\noexpand
      \csname etb@lst@\expandafter\@gobble\string#1\endcsname
      {}{####1}\space####2\noexpand#2\noexpand\etb@lst@sentinel\noexpand#2&%
    }%
    \long\csdef{etb@lst@\expandafter\@gobble\string#1}####1####2####3\noexpand#2{%
      % If the first argument is empty, there's nothing to do for this item.
      % Otherwise `#2{#1}`
      \noexpand\ifstrempty{####1}%
        {}%
        {####2{####1}}%
      % If it's blank, it isn't the sentinel so pass an empty first argument to
      % this macro. Otherwise, check for the sentinel and break if
      % found, otherwise, expand the macro again, but this time strip leading
      % spaces from ####3 using `\@firstofone`.
      \noexpand\ifblank{####3}%
        {%
          \expandafter\noexpand
          \csname etb@lst@\expandafter\@gobble\string#1\endcsname{}{####2}\space
        }%
        {%
          \noexpand\if@etb@sentinel{####3}%
            {\noexpand\listbreak}%
            {%
              \noexpand\expandafter
              \expandafter\noexpand
              \csname etb@lst@\expandafter\@gobble\string#1\endcsname
              \noexpand\expandafter{\noexpand\@firstofone\space####3}{####2}\space
            }%
        }%
    }%
  }%
  \@tempa
}

Note that these don't need \etb@listitem and \etb@listitem@i any longer because leading spaces (and braces) have already been stripped \@firstofone

The \if@etb@sentinel can also be used in \etb@forlistloop@i (that's why the \@empty is there, otherwise it's not necessary):

\long\def\etb@forlistloop@i#1#2|#3|{%
  \ifblank{#2}%
    {}%
    {#1{#2}}%
  \if@etb@sentinel{#3}
    {\listbreak}%
    {\etb@forlistloop@i{#1}{#3}|}%
}

Regardless of whether you want to use this new code for either the comma-separated lists or the internal list, I think you need to revert https://github.com/josephwright/etoolbox/commit/365a80ab84981130b84fb660cc929d2a5e249c7b because stripping braces is definitely the wrong behavior.

josephwright commented 6 years ago

@stevecheckoway Well brace/space behaviour is already not quite what I'd aim for in a new implementation: Try something like {a},b, {c},{d} , {e} with the older code ...

josephwright commented 6 years ago

I'm of course in the position with etoolbox that I don't have any tests set up: I guess I'll need to set some up here.

stevecheckoway commented 6 years ago

Looks like \do{a}\do{b}\do{c}\do{d }\do{e} (and a quick test confirms). The existing behavior may not be ideal, but we definitely shouldn't change it and https://github.com/josephwright/etoolbox/commit/365a80ab84981130b84fb660cc929d2a5e249c7b changed it.

josephwright commented 6 years ago

I'll set up some tests using l3build in the morning, along with reverting the commit if required and addressing the issue(s).

stevecheckoway commented 6 years ago

Tests are a great idea! I recommend a,{b,c} be one of them.

Running

\def\do#1{(#1)}
\def\expected{(a)(b,c)}
\edef\actual{\docsvlist{a,{b,c}}}
\ifx\expected\actual
  \typeout{Match^^J}
\else
  \typeout{Expected: \meaning\expected^^JActual: \meaning\actual^^J}
\fi

with the new code prints

Expected: macro:->(a)(b,c)
Actual: macro:->(a)(b)(c)

and prints Match with the old.

josephwright commented 6 years ago

@stevecheckoway Hopefully things are now working for the list parser case. I'll work on the other performance issues, and I'll give it a day or so before a release unless I hear back that it looks good to you.

stevecheckoway commented 6 years ago

These all look good to me and pass my minimal test cases. Using \@nil instead of \space is clever. Thanks for taking a look at these!