move-coop / parsons

A python library of connectors for the progressive community.
Other
255 stars 125 forks source link

[Bug] GoogleBigQuery get_row_count() method is inefficient and should be changed #995

Open matthewkrausse opened 4 months ago

matthewkrausse commented 4 months ago

The current code runs this sql to get the row data.

sql = f"SELECT COUNT(*) AS row_count FROM {schema}.{table_name}"

But we should change it to this:

SELECT table_name, row_count FROM project.dataset.INFORMATION_SCHEMA.TABLES WHERE table_name = 'your_table_name'

Detailed Description

SELECT COUNT(*):

INFORMATION_SCHEMA.TABLES:

Therefore, by using the INFORMATION_SCHEMA.TABLES method, you only need to read a small amount of metadata instead of the entire table data, resulting in less data usage and improved performance.

To Reproduce

Your Environment

Additional Context

Add any other context about the problem here.

Priority

Please indicate whether fixing this bug is high, medium, or low priority for you. If the issue is time-sensitive for you, please let us know when you need it addressed by.

I believe this should be high priority to change as it could be costly for people using this method. I opened an issue mostly for discussion on this issue before writing the PR.

matthewkrausse commented 4 months ago

Actually I am reading that the row_count this way is an estimate and COUNT(*) is the best way to do this. That's annoying.

matthewkrausse commented 4 months ago

I'm going to keep looking into this to see if there may be a better way.

austinweisgrau commented 4 months ago

Nice catch

matthewkrausse commented 4 months ago

@austinweisgrau I'm not sure if you saw my additional comments but this seems to be not as straightforward. There is another method I see where we can query the metadata another way via the api.

from google.cloud import bigquery

client = bigquery.Client()
table_ref = client.dataset("your_dataset").table("your_table_name")
table = client.get_table(table_ref)
row_count = table.num_rows

print(f"Row count: {row_count}")

However, this only works on tables, not views, and apparently is potentially delayed by a few seconds.

At this point in research it seems COUNT(*) may be the only up-to-date, accurate method for getting the row count.