googleapis / google-cloud-java

Google Cloud Client Library for Java
https://cloud.google.com/java/docs/reference
Apache License 2.0
1.9k stars 1.07k forks source link

Iterating BigQuery query results produces incorrect number of rows #1612

Closed polleyg closed 7 years ago

polleyg commented 7 years ago

This is in relation to post on Stack Overflow post here.

As per the sample code here, a user should be be able to paginate the results of a BigQuery query using the following code structure:

while (result != null) {
      Iterator<List<FieldValue>> iter = result.iterateAll();
      while (iter.hasNext()) {
        List<FieldValue> row = iter.next();
        //do something with row/data
        System.out.println(row);
      }
      result = result.getNextPage();
    }
  }

However, when the result set is large (in my tests >31,000 rows), more rows are returned/iterated even though calling result.getTotalRows() returns the correct/expected number of rows.

For example, I have a table with 85,250 rows (9.45 MB). When I query it via the Java API, and use the code from the example above, it actually iterates 160,296 times.

Even if I limit the result set in the query using limit 5000, and set setPageSize(1000L), then it iterates 15,000 times e.g:

QueryRequest queryRequest = QueryRequest
                .newBuilder("SELECT * FROM [<my-project-id>:<dataset>.<table_with_85250_rows>] limit 5000")
                .setUseLegacySql(true)
                .setPageSize(1000L)
                .build();
        QueryResponse response = bigQuery.query(queryRequest);
        QueryResult result = response.getResult();
        System.out.println("Total rows: " + result.getTotalRows());
        Integer rowNumber = 1;
        while(result != null){
            Iterator<List<FieldValue>> iter = result.iterateAll();
            while(iter.hasNext()){
                List<FieldValue> row = iter.next();
                System.out.println("Row: " + rowNumber + ", with number of columns: " + row.size());
                rowNumber++;
            }
            result = result.getNextPage();
        }

Output:

Total rows: 5000
Row: 1, with number of columns: 11
Row: 2, with number of columns: 11
Row: 3, with number of columns: 11
Row: 4, with number of columns: 11
Row: 5, with number of columns: 11
Row: 6, with number of columns: 11
Row: 7, with number of columns: 11
Row: 8, with number of columns: 11
Row: 9, with number of columns: 11
Row: 10, with number of columns: 11
Row: 11, with number of columns: 11
[.....]
Row: 14997, with number of columns: 11
Row: 14998, with number of columns: 11
Row: 14999, with number of columns: 11
Row: 15000, with number of columns: 11
kamakay commented 7 years ago

Any news? This is a kind of blocking situazion when using API.

polleyg commented 7 years ago

This is not a bug, but rather confusing documentation/examples. See updated SO question - http://stackoverflow.com/questions/42099842/bigquery-how-to-iterate-over-results-using-java-api

garrettjonesgoogle commented 7 years ago

Closing out - see the Stack Overflow question.