Closed matthewkrausse closed 1 month ago
@matthewkrausse Do you think it might be worth changing this method to use the INFORMATION_SCHEMA
if it detects it's a table? Apparently that can be out of date sometimes, so maybe we could add a parameter to force it to do a COUNT(*)
query?
@Jason94 Yeah, that was what prompted this PR, I made an issue #995. I think that could be a good option but not sure how useful it is given the use case. Usually when I want the row count of a table it is to make a comparison. Thoughts?
@Jason94 Yeah, that was what prompted this PR, I made an issue #995. I think that could be a good option but not sure how useful it is given the use case. Usually when I want the row count of a table it is to make a comparison. Thoughts?
That's a good point. I'm sure how much of a concern the computational resources of running a big COUNT() would be. Maybe make COUNT() a default, and add an extra boolean flag for something like fast_mode=False
that would switch it to use INFORMATION_SCHEMA?
Maybe that's not something that would actually be helpful to the way people are using this method in production code? Maybe we could keep it simple and just always use COUNT.
Yeah, I think it's useful when someone is just looking to see if a table has 0 records, 10 records or 10 million records.But, I don't feel like anyone is doing that programatically. They are likely just using the BQ Studio.
So I'm happy to merge this as is, but another approach would be to create another method like get_row_count_estimate
that uses the information_schema.tables
mentioned in https://github.com/move-coop/parsons/issues/995. Then in this doc warning, we could say "if an estimate is fine and you don't want the performance hit, use get_row_count_estimate
" (and the docstring for get_row_count_estimate
could point people to get_row_count
for when accuracy is important or they don't care about performance.
So I'm happy to merge this as is, but another approach would be to create another method like
get_row_count_estimate
that uses theinformation_schema.tables
mentioned in https://github.com/move-coop/parsons/issues/995. Then in this doc warning, we could say "if an estimate is fine and you don't want the performance hit, useget_row_count_estimate
" (and the docstring forget_row_count_estimate
could point people toget_row_count
for when accuracy is important or they don't care about performance.
Let's go ahead and merge this as is.
This pull request adds a cautionary note about the performance of the get_row_count method in the BigQuery materialization. The note warns that using SELECT COUNT(*) can be expensive for large tables, especially those with many columns, as BigQuery scans all table data to perform the count. The note aims to inform developers about the potential performance impact and encourage them to consider alternative approaches when dealing with large tables.