[Bug] GoogleBigQuery get_row_count() method is inefficient and should be changed

matthewkrausse commented 4 months ago

The current code runs this sql to get the row data.

sql = f"SELECT COUNT(*) AS row_count FROM {schema}.{table_name}"

But we should change it to this:

SELECT table_name, row_count FROM project.dataset.INFORMATION_SCHEMA.TABLES WHERE table_name = 'your_table_name'

Detailed Description

SELECT COUNT(*):

Scans all rows in the table to count them individually.
Reads the entire data size of the table.
Can be computationally expensive, especially for large tables.

INFORMATION_SCHEMA.TABLES:

Retrieves metadata about the table from the INFORMATION_SCHEMA.TABLES view.
This view stores pre-aggregated information about tables, including row count.
Only reads a small amount of data from the INFORMATION_SCHEMA.TABLES view.
Is significantly faster and cheaper than scanning the entire table.

Therefore, by using the INFORMATION_SCHEMA.TABLES method, you only need to read a small amount of metadata instead of the entire table data, resulting in less data usage and improved performance.

To Reproduce

Your Environment

Version of Parsons used (if you know):
Environment name and version (e.g. Chrome 39, node.js 5.4):
Operating System and version (desktop or mobile):

Additional Context

Add any other context about the problem here.

Priority

Please indicate whether fixing this bug is high, medium, or low priority for you. If the issue is time-sensitive for you, please let us know when you need it addressed by.

I believe this should be high priority to change as it could be costly for people using this method. I opened an issue mostly for discussion on this issue before writing the PR.

matthewkrausse commented 4 months ago

Actually I am reading that the row_count this way is an estimate and COUNT(*) is the best way to do this. That's annoying.

matthewkrausse commented 4 months ago

I'm going to keep looking into this to see if there may be a better way.

austinweisgrau commented 4 months ago

Nice catch

matthewkrausse commented 4 months ago

@austinweisgrau I'm not sure if you saw my additional comments but this seems to be not as straightforward. There is another method I see where we can query the metadata another way via the api.

from google.cloud import bigquery

client = bigquery.Client()
table_ref = client.dataset("your_dataset").table("your_table_name")
table = client.get_table(table_ref)
row_count = table.num_rows

print(f"Row count: {row_count}")

However, this only works on tables, not views, and apparently is potentially delayed by a few seconds.

At this point in research it seems COUNT(*) may be the only up-to-date, accurate method for getting the row count.

move-coop / parsons