PERF: closed-ref SQL improvements

thermokarst commented 4 years ago

This PR addresses a few issues related to closed-reference clustering. We offload a bunch of data manipulation to sqlite, however there were a few problems:

table input was a bottleneck in places
table output was a bottleneck in place
the query for identifying the representative sequence of a cluster was prone to exploding (due to a cartesian product).

After reviewing the sqlite docs, I discovered that we can rely on some dialect-specific behavior with respect to the aggregate function MAX --- basically when finding the max value of a group, sqlite will allow you to bind the bare columns of that query to other aspects of the statement, which allows us to skip the LEFT OUTER JOIN completely.

I did some psuedo benchmarks, using the ECAM dataset. I gave up at the ~4 hr mark when benchmarking the "before" case (I was producing millions and millions of virtual rows, all destined for deletion, since the query didn't need them). The "after": ~0.5hrs.

thermokarst commented 4 years ago

cc @gregcaporaso

thermokarst commented 4 years ago

I want to run some tests against some known datasets, just to make sure the adjustments proposed here don't change the results. The unit tests pass as-is, but, we all know unit tests are just one part of the story...

thermokarst commented 4 years ago

[x] https://docs.qiime2.org/2020.2/tutorials/otu-clustering/
[x] moving pictures, gg @85%

Okay, I'm reasonably well-convinced. If anyone has ideas for additional unit tests, I'm all ears.

thermokarst commented 4 years ago

One last check - I'll run ECAM through this (before and after changeset), to a) make sure results are the same and b) get detailed "real-world" runtime metrics.

thermokarst commented 4 years ago

Okay, just got the results back in ~200 minutes to cr cluster ecam dataset using 2020.2, while it was ~30 minutes using the changeset proposed in this PR.

Verified with:

In [1]: import qiime2                                                                                                                                                                                                                                                                                                                                

In [2]: import pandas as pd                                                                                                                                                                                                                                                                                                                          

In [3]: rs_old = qiime2.Artifact.load('2020.2-results/clustered_sequences.qza').view(pd.Series)                                                                                                                                                                                                                                                      

In [4]: rs_new = qiime2.Artifact.load('2020.5-results/clustered_sequences.qza').view(pd.Series)                                                                                                                                                                                                                                                      

In [5]: rs_old_sorted = rs_old.sort_index()                                                                                                                                                                                                                                                                                                          

In [6]: rs_new_sorted = rs_new.sort_index()                                                                                                                                                                                                                                                                                                          

In [7]: rs_old_sorted.equals(rs_new_sorted)                                                                                                                                                                                                                                                                                                          
Out[7]: True

In [8]: rs_new_sorted.equals(rs_old_sorted)                                                                                                                                                                                                                                                                                                          
Out[8]: True

In [9]: rs_new_sorted.equals(rs_new)                                                                                                                                                                                                                                                                                                                 
Out[9]: True

In [10]: rs_old_sorted.equals(rs_old)                                                                                                                                                                                                                                                                                                                
Out[10]: False

The reason there is some sorting of the index in there is because this PR does change the order of the clustered seqs, making it so that they are sorted by default.

gregcaporaso commented 4 years ago

@thermokarst, these checks look reasonable to me - seems convincing that you're getting the same results from the two different implementations.

thermokarst commented 4 years ago

Thanks @gregcaporaso! I'm pretty well convinced, too.

qiime2 / q2-vsearch

PERF: closed-ref SQL improvements #67