Closed karel-brinda closed 2 years ago
Interesting. So it is. I'll earmark this for when I can add this in, or I can accept a pull request if anyone is interested.
#!/usr/bin/env perl
# concatenate long strings
use strict;
use warnings;
use Benchmark ':all';
use Test::More 'no_plan';
use List::Util qw/shuffle/;
my $base= join("", shuffle("A".."Z", "a".."z")); # 52 letters in random order
my $numRepeating = 1e1;
my $expected = $base x $numRepeating;
sub incrementalConcat{
my $str="";
for(1..$numRepeating){
$str .= $base;
}
return $str;
}
sub arrayToString{
my @arr = ($base) x $numRepeating;
my $str = join("", @arr);
return $str;
}
is(incrementalConcat(), $expected, "incrementalConcat");
is(arrayToString() , $expected, "arrayToString");
printf("Running with Perl %s on %s\n%s\n", $^V, $^O, '-' x 80);
cmpthese(6e6, {
'incrementalConcat' => sub { incrementalConcat() },
'arrayToString' => sub { arrayToString() },
}
);
$ perl concatenate-strings.pl
ok 1 - incrementalConcat
ok 2 - arrayToString
Running with Perl v5.16.3 on linux
--------------------------------------------------------------------------------
Rate arrayToString incrementalConcat
arrayToString 561272/s -- -38%
incrementalConcat 909091/s 62% --
1..2
62% faster!
Thank you very much for checking this! I have unfortunately very limited experience with debugging and installing Perl packages (I use Bioconda for Mashtree). However, the update should be relatively simple:
I have added an option --sigfigs 10
in v1.2.2. Let me know if it helps, if you reduce the sigfigs to something like 5 instead of the default 10? I am hesitating to join a large array since it might slow down 62% on that step.
Thanks for the update. Agree that reducing significant digits should help as the total string length will go down, but this seems to be a two-edged sword given that it will probably also affect the resulting trees.
Now when I'm looking at the timing comparison, I'm not sure I understand it. Why is my $numRepeating = 1e1;
? Given our matrices our huge, with millions of elements, shouldn't the test use my $numRepeating = 1e6;
or something similar?
It would have just taken too long to run it with 6e6
comparisons. I lowered the comparisons to 6e2
and then changed $numRepeating = 1e6
to get this result, further emphasizing that incremental concatenation instead of joining is faster. This test took at least a few minutes to finish.
ok 1 - incrementalConcat
ok 2 - arrayToString
Running with Perl v5.16.3 on linux
--------------------------------------------------------------------------------
Rate arrayToString incrementalConcat
arrayToString 6.99/s -- -59%
incrementalConcat 17.1/s 145% --
1..2
It was a good point to test something closer to the size of our newick files and so I am glad that you indicated that test would be more informative. Thank you.
I moved the array creation line out of the arrayToString
function as this shouldn't be included in the time measurement, and reran the code on my computer.
#!/usr/bin/env perl
# concatenate long strings
use strict;
use warnings;
use Benchmark ':all';
use Test::More 'no_plan';
use List::Util qw/shuffle/;
my $base= join("", shuffle("A".."Z", "a".."z")); # 52 letters in random order
my $numRepeating = 1e2;
my $expected = $base x $numRepeating;
my @arr = ($base) x $numRepeating;
sub incrementalConcat{
my $str="";
for(1..$numRepeating){
$str .= $base;
}
return $str;
}
sub arrayToString{
my $str = join("", @arr);
return $str;
}
is(incrementalConcat(), $expected, "incrementalConcat");
is(arrayToString() , $expected, "arrayToString");
printf("Running with Perl %s on %s\n%s\n", $^V, $^O, '-' x 80);
cmpthese(6e4, {
'incrementalConcat' => sub { incrementalConcat() },
'arrayToString' => sub { arrayToString() },
}
);
Here's the result from my computer:
$ ./comp.pl
ok 1 - incrementalConcat
ok 2 - arrayToString
Running with Perl v5.26.2 on darwin
--------------------------------------------------------------------------------
(warning: too few iterations for a reliable count)
(warning: too few iterations for a reliable count)
Rate incrementalConcat arrayToString
incrementalConcat 300000/s -- -70%
arrayToString 1000000/s 233% --
1..2
I see what you did, but the array creation is part of the timing for creating the large string.
When Mashtree is applied to a large number of genomes (e.g., 10k), the phylip conversion step seems to be the bottleneck. I'm no expert on Perl, but it seems to me that it might be due to the way how the phylip string (
$str
) is created – Mashtree uses the concatenation assignment operator.=
for extending the string (see e.g. https://github.com/lskatz/mashtree/blob/4e688856bc82fd4a06104ffcd26593326c6ea226/lib/Mashtree/Db.pm#L371). I'm unsure about how this is handled by Perl, but in Python this would be very slow and creating a list of strings that are joined in the end is the recommended practice. Google suggests that the same may apply to Perl too: https://stackoverflow.com/a/3104548/4641846.