Open matthewkrausse opened 4 months ago
Actually I am reading that the row_count this way is an estimate and COUNT(*) is the best way to do this. That's annoying.
I'm going to keep looking into this to see if there may be a better way.
Nice catch
@austinweisgrau I'm not sure if you saw my additional comments but this seems to be not as straightforward. There is another method I see where we can query the metadata another way via the api.
from google.cloud import bigquery
client = bigquery.Client()
table_ref = client.dataset("your_dataset").table("your_table_name")
table = client.get_table(table_ref)
row_count = table.num_rows
print(f"Row count: {row_count}")
However, this only works on tables, not views, and apparently is potentially delayed by a few seconds.
At this point in research it seems COUNT(*) may be the only up-to-date, accurate method for getting the row count.
The current code runs this sql to get the row data.
sql = f"SELECT COUNT(*) AS row_count FROM
{schema}.{table_name}
"But we should change it to this:
SELECT table_name, row_count FROM project.dataset.INFORMATION_SCHEMA.TABLES WHERE table_name = 'your_table_name'
Detailed Description
SELECT COUNT(*):
INFORMATION_SCHEMA.TABLES:
Therefore, by using the INFORMATION_SCHEMA.TABLES method, you only need to read a small amount of metadata instead of the entire table data, resulting in less data usage and improved performance.
To Reproduce
Your Environment
Additional Context
Add any other context about the problem here.
Priority
Please indicate whether fixing this bug is high, medium, or low priority for you. If the issue is time-sensitive for you, please let us know when you need it addressed by.
I believe this should be high priority to change as it could be costly for people using this method. I opened an issue mostly for discussion on this issue before writing the PR.