Closed thermokarst closed 4 years ago
cc @gregcaporaso
I want to run some tests against some known datasets, just to make sure the adjustments proposed here don't change the results. The unit tests pass as-is, but, we all know unit tests are just one part of the story...
@85%
Okay, I'm reasonably well-convinced. If anyone has ideas for additional unit tests, I'm all ears.
One last check - I'll run ECAM through this (before and after changeset), to a) make sure results are the same and b) get detailed "real-world" runtime metrics.
Okay, just got the results back in ~200 minutes to cr cluster ecam dataset using 2020.2, while it was ~30 minutes using the changeset proposed in this PR.
Verified with:
In [1]: import qiime2
In [2]: import pandas as pd
In [3]: rs_old = qiime2.Artifact.load('2020.2-results/clustered_sequences.qza').view(pd.Series)
In [4]: rs_new = qiime2.Artifact.load('2020.5-results/clustered_sequences.qza').view(pd.Series)
In [5]: rs_old_sorted = rs_old.sort_index()
In [6]: rs_new_sorted = rs_new.sort_index()
In [7]: rs_old_sorted.equals(rs_new_sorted)
Out[7]: True
In [8]: rs_new_sorted.equals(rs_old_sorted)
Out[8]: True
In [9]: rs_new_sorted.equals(rs_new)
Out[9]: True
In [10]: rs_old_sorted.equals(rs_old)
Out[10]: False
The reason there is some sorting of the index in there is because this PR does change the order of the clustered seqs, making it so that they are sorted by default.
@thermokarst, these checks look reasonable to me - seems convincing that you're getting the same results from the two different implementations.
Thanks @gregcaporaso! I'm pretty well convinced, too.
This PR addresses a few issues related to closed-reference clustering. We offload a bunch of data manipulation to sqlite, however there were a few problems:
After reviewing the sqlite docs, I discovered that we can rely on some dialect-specific behavior with respect to the aggregate function
MAX
--- basically when finding the max value of a group, sqlite will allow you to bind the bare columns of that query to other aspects of the statement, which allows us to skip theLEFT OUTER JOIN
completely.I did some psuedo benchmarks, using the ECAM dataset. I gave up at the ~4 hr mark when benchmarking the "before" case (I was producing millions and millions of virtual rows, all destined for deletion, since the query didn't need them). The "after": ~0.5hrs.