ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.33k stars 599 forks source link

docs(blog): classification metrics on the backend #10501

Open IndexSeek opened 6 days ago

IndexSeek commented 6 days ago

Description of changes

Adding a blog post breaking down how to perform binary classification metrics with Ibis. I did a fair amount of background explanation on these models and these metrics because many Ibis users may not be as familiar with these topics, but we can scale that back if needed and get more to the point.

deepyaman commented 6 days ago

Description of changes

Adding a blog post breaking down how to perform binary classification metrics with Ibis. I did a fair amount of background explanation on these models and these metrics because many Ibis users may not be as familiar with these topics, but we can scale that back if needed and get more to the point.

Seems like these would also be useful additions to IbisML! ibis_ml.metrics?

ibis-docs-bot[bot] commented 6 days ago

Docs preview: https://pr-10501-264061a498a2eea5081db287669bbe2c04f1b02a--ibis-quarto.netlify.app

IndexSeek commented 5 days ago

Seems like these would also be useful additions to IbisML! ibis_ml.metrics?

I think so! I have given that a good bit of thought and I think it would be worth adding that capability with IbisML. I opened feat: ibis_ml.metrics #174 over there, so hopefully, we can discuss further and plan the approach.

IndexSeek commented 5 days ago

Thanks for the review and the feedback! I agree. The way you demonstrated calculating the true positives, false positives, etc., does seem much more efficient. It also demonstrates how we can break apart calculations and use them in other expressions with Ibis.

Since you do explicitly make a point about performance, maybe it makes sense to show the more efficient method after going through the illustrative labeling approach?

This is a great idea! The illustrative approach helps cement the concepts, and then the more efficient method would demonstrate assigning expressions as variables as using them in other expressions. Something that is far less convenient to do with pure SQL. I'm happy to incorporate this!

Edit: An alternative would be to just show the illustrative approach, add the efficient approach to IbisML, and call the IbisML function to demo the "efficient" path.

What if we added the above efficient approach to the article as it is now, I follow this up with another blog post on regression metrics. Then we have a third blog post to close out the series that throws back to the first two (e.g., we've previously reviewed and demonstrated how to calculate classification and regression metrics with Ibis, in this post, we'll demonstrate how we can perform these calculations out of the box with IbisML) so that we can tie it all together and create a nice mini series of blog posts.

deepyaman commented 5 days ago

Edit: An alternative would be to just show the illustrative approach, add the efficient approach to IbisML, and call the IbisML function to demo the "efficient" path.

What if we added the above efficient approach to the article as it is now, I follow this up with another blog post on regression metrics. Then we have a third blog post to close out the series that throws back to the first two (e.g., we've previously reviewed and demonstrated how to calculate classification and regression metrics with Ibis, in this post, we'll demonstrate how we can perform these calculations out of the box with IbisML) so that we can tie it all together and create a nice mini series of blog posts.

Sounds good to me! From my perspective, part of seeing your posts is also an indicator of what, if anything, somebody may actually want to use Ibis for in the ML space. Happy to use the blogs as a leading indicator. :)

IndexSeek commented 5 days ago

I just updated it to incorporate this approach. Thank you for sharing those snippets! Hopefully it flows well - I'm happy to adjust as necessary.

ibis-docs-bot[bot] commented 5 days ago

Docs preview: https://pr-10501-66dce135710001b077d7ae067124023f9a4282a3--ibis-quarto.netlify.app

IndexSeek commented 3 days ago

I'm ready to go with this one if we're good with it! (pending the date edit).

Thanks for your help and the thorough review @deepyaman, I think it greatly improves the post!

IndexSeek commented 13 hours ago

Hey @IndexSeek -- this looks good to me! Do you have a particular date you'd like to release it on?

I feel like @lostmygithubaccount would tell us to not publish it on a Friday.

Sweet! Thank you for the review and approval.

I think this upcoming Monday would work out well, given later in the week many potential US readers would rather be consuming turkey than consuming information on classification metrics.

I edited my suggestion above so it is easier to tweak when we are ready to go if that date is okay.

ibis-docs-bot[bot] commented 12 hours ago

Docs preview: https://pr-10501-746edcb9a5f5ad004cab4de949c8ce5ba67d01d9--ibis-quarto.netlify.app

lostmygithubaccount commented 9 hours ago

I feel like @lostmygithubaccount would tell us to not publish it on a Friday.

generally wouldn't recommend publishing on Friday + a lot of people will be out all of next week for Thanksgiving. but idk, maybe people want something to read still

great blog! not necessary, but could be cool to demonstrate a plot of the confusion matrix with one of the visualization libraries

also this reminded me of what could be a cool follow up blog for using binary classification to detect data drift over time (described as two-sample tests here: https://arxiv.org/abs/1610.06545 and various other articles since). it's a really cool application and in theory Ibis + XGBoost or LightGBM makes it trivial to implement on a ton of backends