Open rusnackor opened 3 months ago
InfluxDB is a columnar, schema-less database. There is no way for InfluxDB to know when a point is fully written, because new fields can be added at any time. When processing a write operation, each field is written separately (which is how fields can be added to a point via multiple writes).
Writes to different fields for the same point can happen at different times. Here are two points being inserted. One has two values written in a single INSERT
statement, then the other is written with two INSERT
statements separated by a SELECT
, and then the first point has a third field written.
As a user, if you add all fields of a point in one write operation, you can be assured that all fields are written when that operation finishes and returns a success code. So perhaps you can query by looking for data by row ID only after you are sure that the write for that row ID has completed. So in your filter you could say something like
<omitted code>
|> filter(fn: (r) => r["row_id"] > {id_start} and r["row_id"] <= {id_last_written})'
> INSERT foo,tagone=t1 v1=34,v2=35 1721684170301626470
> select * from foo
name: foo
time tagone v1 v2
---- ------ -- --
1721684170301626470 t1 34 35
>
> INSERT foo,tagone=t2 v1=13 1721684170301626480
> select * from foo
name: foo
time tagone v1 v2
---- ------ -- --
1721684170301626470 t1 34 35
1721684170301626480 t2 13
>
> INSERT foo,tagone=t2 v2=15 1721684170301626480
> select * from foo
name: foo
time tagone v1 v2
---- ------ -- --
1721684170301626470 t1 34 35
1721684170301626480 t2 13 15
>
> INSERT foo,tagone=t1 v3=67 1721684170301626470
> select * from foo
name: foo
time tagone v1 v2 v3
---- ------ -- -- --
1721684170301626470 t1 34 35 67
1721684170301626480 t2 13 15
>
Hello, I have been going through this issue already on influxdb-python-client, but from the communication with Jakub and additional tests, it seems like it is issue of the DB itself, not a client. Here is the link to related issue on python-client page: https://github.com/influxdata/influxdb-client-python/issues/662
Steps to reproduce:
tag
and 7field
"columns". One of thefield
is unique sequence integer. It is generated externally, since InfluxDB does not have such capability, and added to data points before sending the batch of data points to DB (python-client).stop
time given, just astart
time. We are basically querying "give me data that were added since my last query". To receive every data only once, I use|> filter(fn: (r) => r["row_id"] > <last_received_id>)
null
values in DB, sometimes "reader" receivesnull
in randomfield
value of random data point. Often more than onenull
column is present, but it is nevertag
.Full query:
What we have observed so far is, that those nulls happen very close to
now()
date. And Jakub suggested, it is caused bypivot
function. To test that, I have addedstop
date, so delay. Jakub suggested this:|> range(start: _start, stop: -10ms)
. I did test with 1 minute delay, wherestart
andstop
datetimes were calculated in python and passed to query as parameters. This is working, once I added this delay in reading, I havent seen anynull
values extracted from DB. (It was like 5 nulls per hour before delay. No nulls observed after 24h of running test with delay)BUT, this is not a solution for us. The data we are receiving (and writting) are not chronologically ordered, and it is impossible to have them chronologically ordered (on input). We also need to extract each new data point only once, not more. Otherwise next layer, above us will have duplicates and that will cause errors. That is why we added the
row_id
for filtering. And because we use alsorow_id
to filter data, it is possible that we will be loosing data if we addstop
time inrange()
function. Basically, the chronologically "younger" datapoint, that is excluded bystop
parameter, can have lowerrow_id
because it was received earlier than "older" data points. In that case, this data point will be excluded in first query bystop
range parameter. Than in next query, it will be filtered out again byr["row_id"] > {id_start}
filter. And we will never extract it from DB. (I dont want to go too much in details about this here, since is different topic, but I can explain it in comments, if asked).Expected behaviour: Once datapoints are written, they are fully usable. So if I ask DB "give me all data from 1min ago until now", it will give me all data that are fully available. In know that internal structure of DB has each
field
assigned to itstag
separately. But if I write datapoint with 7field
values, none of those 7field
values should be available for reading, until all of them are "processed" correctly. It should be one transaction.Actual behaviour: If I write data point with 7
field
values, some of the values are available for reading before others. And if I read the data point "too fast" I can receive only partial information for this data point. This is unacceptable. It would be no issue, if I do not receive this data point at all, since it is not yet fully "processed". But returning partial data is causing us big issues in our implementation.Environment info:
Config: No modification in config.
Logs: Unfortunately, I cannot share any details publicly, because of corporate cybersecurity rules. However, I have permission to share details needed to reproduce the issue privately (direct messages on Slack, for example).