rviscomi / har.fyi

https://har.fyi
10 stars 3 forks source link

Example query 1 is estimated to cost 42TB and is thus blocked by the recommended 1TB cost control. #10

Closed JannisBush closed 1 month ago

JannisBush commented 1 month ago

Hi, I am currently doing the guided tour and have set up a cost control of 1TB per day as recommended. When running the initial query:

# This query will process 898 MB when run.
%%bigquery df_preview --project httparchive
SELECT *
FROM `httparchive.all.pages`
WHERE date = '2024-05-01'
    AND client='desktop'
    AND is_root_page
    AND rank = 1000
    AND page = 'https://www.google.com/'

I receive the following error ERROR: 403 Custom quota exceeded: Your usage exceeded the custom quota for QueryUsagePerDay, which is set by your administrator. For more information, see https://cloud.google.com/bigquery/cost-controls; reason: quotaExceeded, message: Custom quota exceeded: Your usage exceeded the custom quota for QueryUsagePerDay, which is set by your administrator. For more information, see https://cloud.google.com/bigquery/cost-controls.

When I copy the same query to the bigquery console directly, the estimate shows This query will process 42.14 TB when run. and fail with the same error message.

rviscomi commented 1 month ago

This is a bug in BigQuery, unfortunately. The query filters on clustered fields (client, is_root_page, rank) but the byte estimate doesn't always take that into consideration. For example, I see the correct estimate:

image

If you're running your analysis as part of the Web Almanac, you should be temporarily added as a member of HTTP Archive's GCP project. That way 100% of your costs will be billed to us. Many queries for the Web Almanac routinely (and necessarily) exceed the 1 TB limit, so in that case it'd be ok to remove the cost controls as long as you're sure you're running the queries against the HTTP Archive project. To verify this, look for project=httparchive in the URL and HTTP Archive selected in the project list:

image

In any case, I agree that this is a poor UX and it would be much better if BigQuery used more accurate estimates. I'll see if I can nudge any internal bugs about this.

rviscomi commented 1 month ago

@max-ostapenko is it worth adding a note about this to the docs?

max-ostapenko commented 1 month ago

@JannisBush I see you're part of WebAlmanac analysts team. I strongly urge to keep quotas in place for any personal or other projects - exactly as you're doing. Thanks to this error you can actually be aware that you are billing queries to your own project. Please switch to httparchive project

@rviscomi seems it's time to bring an article on billing matters to har.fyi, bytes estimates issue will go there.

  1. Any notes on the article content?
  2. Will add more details to WebAlmanac billing wiki.
rviscomi commented 1 month ago

How about adding it to https://har.fyi/guides/minimizing-costs/ for now? I was also thinking that the guided tour could have a note about it for anyone else who gets mixed messaging about the query estimates.

JannisBush commented 1 month ago

@max-ostapenko Yes, I am part of the WebAlmanac team but still need to be added to the httparchive project. For my initial tests to get to know the httparchive, I used a personal project.

The guide will probably be used by people not in the WebAlmanac team in the future. I would suggest to only use queries that fit in the free 1TB tier, such that one can safely follow the full guide.

@rviscomi Could it be that the estimate is correct for you as you have already run the query once?

max-ostapenko commented 1 month ago

@JannisBush that's a good point! Most of the queries are be covered by Free tier, but there is one 800+ GB query we could sample.

I see the issue also is that each query is defined to run under 'httparchive' project but readers may not have access to it.

Regarding the estimate difference - it's that query estimator benefits from additional access to the related metadata.