runt18 / google-bigquery

Automatically exported from code.google.com/p/google-bigquery
0 stars 0 forks source link

providing most recent lastModifiedTime of all tables involved in a query in dryRun #302

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Hi,

We would like to have a parameter providing a lastModifiedTime across all 
tables included in a query.

{
 "kind": "bigquery#queryResponse",
 "jobReference": {
  "projectId": "owox-demo"
 },
 "totalBytesProcessed": "6769177",
 "jobComplete": true,
 "lastModifiedTime": "1439483276700",
 "cacheHit": false
}

It would be great to have it available in dryRun.
dryRun counts the bytes across all tables involved, so similarly it can count 
latest lastModifiedTime of these tables.

Users will be able to compare value of this parameter to a timestamp of 
previously queried data and decide whether they should run a query again.
It will help Google to save processing power and save querying expenses for 
users.

Original issue reported on code.google.com by bvz2001 on 13 Aug 2015 at 4:56

GoogleCodeExporter commented 8 years ago
Thanks for the suggestion.

You can already get some of this functionality via the existing query cache. 
See the "Ensuring cached query results" section here:

https://cloud.google.com/bigquery/querying-data#querycaching

Basically, you can run a query that will only succeed if it hits in the cache. 
You will never be charged for running this query, and if it fails, you can 
choose whether to pay to re-run.

And of course, you can also run using the normal caching mechanism to get 
standard best-effort caching.

The downside compared to the feature you requested is that the caching window 
is only 24 hours, so you will always get a cache miss after that time, even if 
the query results haven't changed.

Original comment by jcon...@google.com on 13 Aug 2015 at 5:35

GoogleCodeExporter commented 8 years ago
Hi,
unfortunately the existing query cache will not help in this case.
The cache is shared across all the users.
So if anybody will request the query results and my application will request 
the same query — it will hit the cache anyway.

So it can't provide consistent behaviour for an automated dataflow.

Please, consider my FR or suggest a solution.

Original comment by bvz2001 on 14 Aug 2015 at 4:05

GoogleCodeExporter commented 8 years ago
Actually, the query cache is per user. But we will consider your feature 
request, since there are cases it covers that are not covered by the existing 
cache.

Original comment by jcon...@google.com on 14 Aug 2015 at 4:44

GoogleCodeExporter commented 8 years ago

Original comment by thomasp...@google.com on 24 Nov 2015 at 9:14

GoogleCodeExporter commented 8 years ago
I would also like to see this feature. According to the API JavaDoc, query 
caching is not available when destinationTable is set but we do that in ETL 
style pipelines and it'd be great to be able to skip certain steps.

Original comment by nevi...@spotify.com on 10 Dec 2015 at 7:24

GoogleCodeExporter commented 8 years ago
Hi Jeremy and Thomas,

Is there any hope that this FR will be realeased in 2016?
The complexity of our BQ pipelines grows and it requires smarter query 
scheduling than "daily/weekly".

Original comment by m.ostape...@owox.com on 12 Apr 2016 at 10:47