Closed convers39 closed 1 month ago
BTW google support responded with 'ask our sales or account team', now I can only rely on github :(
Hopefully someone can follow on this question 🙏
I found the mapping in Python library looks also the same. https://github.com/googleapis/python-bigquery-storage/blob/54f9d21db873db50b505b97f46019ba89e709b00/google/cloud/bigquery_storage_v1/types/storage.py#L827-L828
I wonder if this repo has a product owner or any dedicated team at all, because people seem to be entirely on their own in here. And all the API's feel pretty "alpha" and WIP.
It's weird.
@convers39 Sorry for the late reply on this, I was out during the time this issue landed and I ended missing giving a follow up here and I'm very sorry.
To answer some of your questions:
I tried to decode the Buffer in error.details with toString, and indeed I found the ALREADY_EXISTS keyword. Now I don't have any idea what will the error code be for OUT_OF_RANGE error, or where to find the correct error code list.
This is indeed confusing, but let me explain here. The error code that you got 6
is the gRPC Status Code ALREADY_EXISTS
, which is the gRPC status that is thrown if user provides an offset that has already been written to or gRPC Status 11
OUT_OF_RANGE
if an attempt is made to append to an offset beyond the current end of the stream. Then with those errors codes, you can get different Storage Error codes, which are the ones you can parse using this utility here and are going to have those other extra error codes that you already found out.
To clarify: If offset is specified, the offset is checked against the end of stream. You can retry with adjusted offset within the same StreamConnection. If offset is not specified, append happens at the end of the stream.
Will the error like OFFSET_ALREADY_EXISTS be thrown somewhere? Or do I have to check the result for each PendingWrite?
The OFFSET_ALREADY_EXISTS
error is an AppendRowsResponse
level error, so it should come only as a part of the response that you get from PendingWrite.getResult
(which is in the end an AppendRowsResponse
) and it will be on the error
attribute of it.
Do I need to control the offset on my application side instead of an empty parameter? As the doc said I should manage stream offset to achieve exactly-once semantics. Meanwhile, I also tried with an empty offset parameter and cannot find any difference in the behavior, I am not sure if an empty offset will produce duplicate insertion.
This answer depends on your application needs. As you already mentioned that you need exactly-once semantics, it's better to use offsets. This page contains a guide on which WriteStream type is the best depending on your use case: https://cloud.google.com/bigquery/docs/write-api. But yes, if you use a Default stream for example, is not guaranteed that you're not going to get duplicate insertions.
@larssn sorry that you feel that way, we do have a team and I'm current the owner of this repo. I have been trying to keep all issues answered and resolved here, and the issue count number here is much lower than the past. The open issues that we have are mostly internal feature requests/improvements which we can't work right now due to other work streams.
We also have work on the BigQuery Storage Read API that is happening, which is going to make life easier to use that API and fetch results using less memory and much faster. Similar to this Write veneer that didn't even exists an year ago, which makes users life easier as we already have other customer successful stories.
Again sorry, for the super late reply here.
@alvarowolfx thank you for your follow-up
I have proceeded with my implementation based on what I got from the pendingWrite.getResult
response.
And thanks for clarifying the OUT_OF_RANGE
and ALREADY_EXISTS
error codes. I could reproduce OUT_OF_RANGE
by passing a bigger offset than expected, which does respond with error code 11
.
Thus as you said, I need to manage the offset value on my client side, so that the OUT_OF_RANGE
error would never happen, and just ignore ALREADY_EXISTS
in case it occurred.
Currently, I have my implementation as below. I may adjust the error handling based on the grpc status code instead.
export const streamDataToBq = async <T extends Array<Record<string, unknown>>>({
destinationTable: string,
data: T,
logger: Logger,
}) => {
const streamType = managedwriter.CommittedStream;
const writeClient = new managedwriter.WriterClient();
logger.info('preparing write stream');
const writeStream = await writeClient.createWriteStreamFullResponse({
streamType,
destinationTable,
});
if (writeStream.tableSchema == null) {
throw new Error(`tableSchema for table '${destinationTable}' is undefined`);
}
const protoDescriptor = adapt.convertStorageSchemaToProto2Descriptor(
writeStream.tableSchema,
'root',
);
if (writeStream.name == null) {
throw new Error(`writeStream for table '${destinationTable}' is undefined`);
}
const streamId = writeStream.name;
const connection = await writeClient.createStreamConnection({
streamId,
});
logger.info(`Stream connection created: ${streamId}`);
const writer = new managedwriter.JSONWriter({
connection,
protoDescriptor,
});
try {
logger.info('appending data to write stream');
let currentOffset = 0;
const pendingWrites = [];
while (currentOffset < data.length) {
const dataChunk = data.slice(
currentOffset,
currentOffset + BQ_INSERT_DATA_CHUNCK_SIZE,
);
const pw = writer.appendRows(dataChunk as JSONList, currentOffset);
pendingWrites.push(pw);
currentOffset += dataChunk.length;
logger.info('pending write pushed', {
currentOffset,
});
}
const results = await Pomise.all(pendingWrites.map(pw => handleAppendResult(pw, logger)));
logger.info('data inserted');
return results;
} catch (e: unknown) {
// ...
} finally {
logger.info('close steam connection');
await connection.finalize();
}
};
const handleAppendResult = async (pw: PendingWrite, logger: LoggerProtocol) => {
const { appendResult, error, rowErrors } = await pw.getResult();
if (appendResult?.offset?.value == null) {
logger.warn('No offset returned in appendResult');
}
if (error != null) {
const errorDetails =
error.details?.map((b) => ({
...b,
value: b.value?.toString(),
})) ?? [];
// NOTE: https://cloud.google.com/bigquery/docs/write-api-best-practices#manage_stream_offsets_to_achieve_exactly-once_semantics
// ignore
if (errorDetails.some((e) => e.value?.includes('ALREADY_EXISTS'))) {
logger.warn('error "OFFSET_ALREADY_EXISTS" occurred, safely ingored');
return appendResult;
}
logger.error('error occurred while writing data', {
...error,
errorDetails,
});
throw new GrpcStatusError(error);
}
if (rowErrors != null && rowErrors.length > 0) {
const errMessages = rowErrors.map((e) => e.message);
logger.error('rowErrors occured', { errMessages });
throw new AggregateError(
rowErrors,
'RowErrors occured while inserting data',
);
}
return appendResult;
};
@larssn sorry that you feel that way, we do have a team and I'm current the owner of this repo. I have been trying to keep all issues answered and resolved here, and the issue count number here is much lower than the past. The open issues that we have are mostly internal feature requests/improvements which we can't work right now due to other work streams.
We also have work on the BigQuery Storage Read API that is happening, which is going to make life easier to use that API and fetch results using less memory and much faster. Similar to this Write veneer that didn't even exists an year ago, which makes users life easier as we already have other customer successful stories.
Again sorry, for the super late reply here.
Super glad to hear to hear it, because in the past there was little to no response in this repo, and we've finally started using the Storage Write API, now that proto files are no longer needed.
Anyway, I don't want to sidetrack the issue. Thanks!
Thanks for stopping by to ask us a question! Please make sure to include:
PLEASE READ: If you have a support contract with Google, please create an issue in the support console instead of filing on GitHub. This will ensure a timely response.
What you're trying to do
I migrate my code from
tabledata.insertAll
API to storage write API, as I found thattabledata.insertAll
will occasionally insert the same data twice.I implemented it as the document and code example shows with
CommittedStream
, however I have some trouble handling errors.What I am trying to do is to ignore errors as the doc suggested but meanwhile catch errors that should not be ignored.
What code you've already tried
Here is the code of my implementation.
Any error messages you're getting
PendingWrite.getResult
will containsrowErrors
anderror
properties, the documented 2 errors will come in theerror
prop. Here is the screenshot when I produce the error intentionally with the same offset value.I tried to decode the Buffer in
error.details
withtoString
, and indeed I found theALREADY_EXISTS
keyword.However, the
error.code
is6
, which is different from what I found in the source codeNow I don't have any idea what will the error code be for
OUT_OF_RANGE
error, or where to find the correct error code list.Additional questions
Apart from the error code mismatch issue above, I am also not sure about the error handling implementation, and the offset manipulation due to the lack of sample code or documentation.
PendingWrite
?