laminlabs / lamindb

A data framework for biology.
https://docs.lamin.ai
Apache License 2.0
129 stars 12 forks source link

🚸 Improve search #2135

Closed sunnyosun closed 1 week ago

sunnyosun commented 2 weeks ago

Key improvements:

Note: centrocyte appears in the "b cell" search because there's a perfect match of "B cell" in the description, same for "t cell" results.

LaminDB Before LaminDB After Hub
Screenshot 2024-11-07 at 12 24 25 Screenshot 2024-11-07 at 12 42 48 Screenshot 2024-11-07 at 12 21 23
Screenshot 2024-11-07 at 12 24 31 Screenshot 2024-11-07 at 12 43 10 Screenshot 2024-11-07 at 12 22 12
Screenshot 2024-11-07 at 12 24 17 Screenshot 2024-11-07 at 12 43 18 Screenshot 2024-11-07 at 12 22 34
Screenshot 2024-11-07 at 13 10 44 Screenshot 2024-11-07 at 13 10 15 Screenshot 2024-11-07 at 13 11 40
Screenshot 2024-11-07 at 16 06 53 Screenshot 2024-11-07 at 21 43 05 Screenshot 2024-11-07 at 16 11 32
Screenshot 2024-11-07 at 16 09 37 Screenshot 2024-11-07 at 21 43 25 Screenshot 2024-11-07 at 16 11 58
codecov[bot] commented 2 weeks ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 92.63%. Comparing base (57fbd29) to head (2ee90cb). Report is 29 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #2135 +/- ## ========================================== + Coverage 92.53% 92.63% +0.10% ========================================== Files 55 55 Lines 6467 6508 +41 ========================================== + Hits 5984 6029 +45 + Misses 483 479 -4 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

github-actions[bot] commented 2 weeks ago

🚀 Deployed on https://672d2c122dca6b5b514d2906--lamindb-qnwk.netlify.app

falexwolf commented 2 weeks ago

This single example looks fantastic!

I'm just worried that it will deteriorate other cases.

Can you run this against @Koncopd's benchmarking framework?

And then we have a before after comparison that includes a wider array of search cases and a report that underlies the decision published to LaminHub.

sunnyosun commented 2 weeks ago

Updated screenshots with the examples @Koncopd had, let me know if there's anything else I can test.

Koncopd commented 2 weeks ago

@sunnyosun @falexwolf here is the benchmark for this PR https://github.com/laminlabs/lamindb/pull/2141 https://672cb784f7316d7437f2f69d--lamindb-qnwk.netlify.app/faq/benchmark-search

falexwolf commented 2 weeks ago

These are great improvements!

I wish there was good way in laminhub to document such changes but I guess there isn't for now.

falexwolf commented 2 weeks ago

@fredericenard, please also take a look here.

And then we'll discuss in the benchmarking PR how we proceed with organizing code across lamindb, bionty and laminhub.

Koncopd commented 2 weeks ago

Added more searches to the benchmark. test_search_synonyms fails now, commented out in the benchmark.

sunnyosun commented 2 weeks ago

Fixed synonyms. The only thing is that it's getting a bit slower now 0.7-0.8s per search (before 0.2-0.3s) because of all the different layers. (us-west-2 instance)

Zethson commented 2 weeks ago

Ohh and can we get rid of this now? https://docs.lamin.ai/query-search image 🚀

falexwolf commented 2 weeks ago

Yes, removing will be the goal.

image

@Koncopd is making a push to consolidate all 3 search algorithms into one clearly documented and benchmarked solution.

Sunny's PR here has good ideas but it's not suitable for the hub due to performance. So, the hope is Sergei can replicate the UX with server-side code (essentially correctly using postgres plugins). We can still use Sunny's code for dataframes and sqlite; it'll give the same results but just run slower which is OK in that context.