The phylip conversion step too slow

karel-brinda commented 4 years ago

When Mashtree is applied to a large number of genomes (e.g., 10k), the phylip conversion step seems to be the bottleneck. I'm no expert on Perl, but it seems to me that it might be due to the way how the phylip string ($str) is created – Mashtree uses the concatenation assignment operator .= for extending the string (see e.g. https://github.com/lskatz/mashtree/blob/4e688856bc82fd4a06104ffcd26593326c6ea226/lib/Mashtree/Db.pm#L371). I'm unsure about how this is handled by Perl, but in Python this would be very slow and creating a list of strings that are joined in the end is the recommended practice. Google suggests that the same may apply to Perl too: https://stackoverflow.com/a/3104548/4641846.

lskatz commented 4 years ago

Interesting. So it is. I'll earmark this for when I can add this in, or I can accept a pull request if anyone is interested.

#!/usr/bin/env perl
# concatenate long strings
use strict;
use warnings;
use Benchmark ':all';
use Test::More 'no_plan';
use List::Util qw/shuffle/;

my $base= join("", shuffle("A".."Z", "a".."z")); # 52 letters in random order
my $numRepeating = 1e1;
my $expected = $base x $numRepeating;

sub incrementalConcat{
  my $str="";
  for(1..$numRepeating){
    $str .= $base;
  }
  return $str;
}

sub arrayToString{
  my @arr = ($base) x $numRepeating;
  my $str = join("", @arr);
  return $str;
}

is(incrementalConcat(), $expected, "incrementalConcat");
is(arrayToString()    , $expected, "arrayToString");

printf("Running with Perl %s on %s\n%s\n", $^V, $^O, '-' x 80);
cmpthese(6e6, {
        'incrementalConcat' => sub { incrementalConcat() },
        'arrayToString'     => sub { arrayToString() },
    }
);

$ perl concatenate-strings.pl
ok 1 - incrementalConcat
ok 2 - arrayToString
Running with Perl v5.16.3 on linux
--------------------------------------------------------------------------------
                      Rate     arrayToString incrementalConcat
arrayToString     561272/s                --              -38%
incrementalConcat 909091/s               62%                --
1..2

lskatz commented 4 years ago

62% faster!

karel-brinda commented 4 years ago

Thank you very much for checking this! I have unfortunately very limited experience with debugging and installing Perl packages (I use Bioconda for Mashtree). However, the update should be relatively simple:

creating an array for rows of the matrix
for every row (https://github.com/lskatz/mashtree/blob/4e688856bc82fd4a06104ffcd26593326c6ea226/lib/Mashtree/Db.pm#L367), creating a temporary array, filling it with individual strings (https://github.com/lskatz/mashtree/blob/4e688856bc82fd4a06104ffcd26593326c6ea226/lib/Mashtree/Db.pm#L371), calling join, calling the regex – https://github.com/lskatz/mashtree/blob/4e688856bc82fd4a06104ffcd26593326c6ea226/lib/Mashtree/Db.pm#L373 (it's currently called on the whole string many times which probably causes additional slowdown)
calling join on the main array

lskatz commented 3 years ago

I have added an option --sigfigs 10 in v1.2.2. Let me know if it helps, if you reduce the sigfigs to something like 5 instead of the default 10? I am hesitating to join a large array since it might slow down 62% on that step.

karel-brinda commented 3 years ago

Thanks for the update. Agree that reducing significant digits should help as the total string length will go down, but this seems to be a two-edged sword given that it will probably also affect the resulting trees.

Now when I'm looking at the timing comparison, I'm not sure I understand it. Why is my $numRepeating = 1e1; ? Given our matrices our huge, with millions of elements, shouldn't the test use my $numRepeating = 1e6; or something similar?

lskatz commented 3 years ago

It would have just taken too long to run it with 6e6 comparisons. I lowered the comparisons to 6e2 and then changed $numRepeating = 1e6 to get this result, further emphasizing that incremental concatenation instead of joining is faster. This test took at least a few minutes to finish.

ok 1 - incrementalConcat
ok 2 - arrayToString
Running with Perl v5.16.3 on linux
--------------------------------------------------------------------------------
                    Rate     arrayToString incrementalConcat
arrayToString     6.99/s                --              -59%
incrementalConcat 17.1/s              145%                --
1..2

lskatz commented 3 years ago

It was a good point to test something closer to the size of our newick files and so I am glad that you indicated that test would be more informative. Thank you.

karel-brinda commented 3 years ago

I moved the array creation line out of the arrayToString function as this shouldn't be included in the time measurement, and reran the code on my computer.

#!/usr/bin/env perl
# concatenate long strings
use strict;
use warnings;
use Benchmark ':all';
use Test::More 'no_plan';
use List::Util qw/shuffle/;

my $base= join("", shuffle("A".."Z", "a".."z")); # 52 letters in random order
my $numRepeating = 1e2;
my $expected = $base x $numRepeating;

my @arr = ($base) x $numRepeating;

sub incrementalConcat{
  my $str="";
  for(1..$numRepeating){
    $str .= $base;
  }
  return $str;
}

sub arrayToString{
  my $str = join("", @arr);
  return $str;
}

is(incrementalConcat(), $expected, "incrementalConcat");
is(arrayToString()    , $expected, "arrayToString");

printf("Running with Perl %s on %s\n%s\n", $^V, $^O, '-' x 80);
cmpthese(6e4, {
        'incrementalConcat' => sub { incrementalConcat() },
        'arrayToString'     => sub { arrayToString() },
    }
);

Here's the result from my computer:

$ ./comp.pl 
ok 1 - incrementalConcat
ok 2 - arrayToString
Running with Perl v5.26.2 on darwin
--------------------------------------------------------------------------------
            (warning: too few iterations for a reliable count)
            (warning: too few iterations for a reliable count)
                       Rate incrementalConcat     arrayToString
incrementalConcat  300000/s                --              -70%
arrayToString     1000000/s              233%                --
1..2

lskatz commented 3 years ago

I see what you did, but the array creation is part of the timing for creating the large string.

lskatz / mashtree

The phylip conversion step too slow #55