move-coop / parsons

A python library of connectors for the progressive community.
Other
255 stars 125 forks source link

Add cautionary note about performance in get_row_count method #998

Closed matthewkrausse closed 1 month ago

matthewkrausse commented 4 months ago

This pull request adds a cautionary note about the performance of the get_row_count method in the BigQuery materialization. The note warns that using SELECT COUNT(*) can be expensive for large tables, especially those with many columns, as BigQuery scans all table data to perform the count. The note aims to inform developers about the potential performance impact and encourage them to consider alternative approaches when dealing with large tables.

Jason94 commented 4 months ago

@matthewkrausse Do you think it might be worth changing this method to use the INFORMATION_SCHEMA if it detects it's a table? Apparently that can be out of date sometimes, so maybe we could add a parameter to force it to do a COUNT(*) query?

https://www.metaplane.dev/blog/four-efficient-techniques-to-retrieve-row-counts-in-bigquery-tables-and-views

matthewkrausse commented 4 months ago

@Jason94 Yeah, that was what prompted this PR, I made an issue #995. I think that could be a good option but not sure how useful it is given the use case. Usually when I want the row count of a table it is to make a comparison. Thoughts?

Jason94 commented 4 months ago

@Jason94 Yeah, that was what prompted this PR, I made an issue #995. I think that could be a good option but not sure how useful it is given the use case. Usually when I want the row count of a table it is to make a comparison. Thoughts?

That's a good point. I'm sure how much of a concern the computational resources of running a big COUNT() would be. Maybe make COUNT() a default, and add an extra boolean flag for something like fast_mode=False that would switch it to use INFORMATION_SCHEMA?

Maybe that's not something that would actually be helpful to the way people are using this method in production code? Maybe we could keep it simple and just always use COUNT.

matthewkrausse commented 4 months ago

Yeah, I think it's useful when someone is just looking to see if a table has 0 records, 10 records or 10 million records.But, I don't feel like anyone is doing that programatically. They are likely just using the BQ Studio.

shaunagm commented 4 months ago

So I'm happy to merge this as is, but another approach would be to create another method like get_row_count_estimate that uses the information_schema.tables mentioned in https://github.com/move-coop/parsons/issues/995. Then in this doc warning, we could say "if an estimate is fine and you don't want the performance hit, use get_row_count_estimate" (and the docstring for get_row_count_estimate could point people to get_row_count for when accuracy is important or they don't care about performance.

matthewkrausse commented 1 month ago

So I'm happy to merge this as is, but another approach would be to create another method like get_row_count_estimate that uses the information_schema.tables mentioned in https://github.com/move-coop/parsons/issues/995. Then in this doc warning, we could say "if an estimate is fine and you don't want the performance hit, use get_row_count_estimate" (and the docstring for get_row_count_estimate could point people to get_row_count for when accuracy is important or they don't care about performance.

Let's go ahead and merge this as is.