optakt / flow-dps

Flow Data Provisioning Service
Apache License 2.0
29 stars 13 forks source link

Ignore invalid utf8 chars in GCP decoding #539

Closed sideninja closed 1 year ago

sideninja commented 1 year ago

GCP streamer failed to decode block data that contained invalid utf8 characters. The explanation as to why that data was included is here (copy from Slack):

I’ve looked into this and it seems the issue is caused by the transaction with ID f6c8e65646a3b140902aa7559ae2e740bbe92fbef65f414a441c141340a5756f more specifically last argument of the transaction, if looked at closely you can see the whitespace before the address is not actual whitespace but invalid utf8 character, under further investigation I’ve found it’s BOM character https://en.wikipedia.org/wiki/Byte_order_mark The problem is then in CBOR encoding/decoding assuming utf8 validity which in this case breaks. Because this is the first time (to my limited knowledge) the tx args are CBOR encoded/decoded and since tx args are provided by user input we can make the RN node fail with current setting. Making sure we set CBOR decoding flag to enable non valid utf-8 chars would fix this and at the same time understanding this issue I feel it would be an ok fix. So it’s not any bugs in the uploader producing malformed data but it’s an invalid input from the user.

This allows such characters and thus avoid failing.

Misc