influxdata / docs.influxdata.com-ARCHIVE

ARCHIVE - 1.x docs for InfluxData
https://archive.docs.influxdata.com/
MIT License
252 stars 292 forks source link

What exactly is the story on nulls? #717

Closed dandv closed 7 years ago

dandv commented 7 years ago

This PR suggests that you shouldn't be able to insert nulls in the database. https://www.influxdata.com/get-started/influxdb-command-line-basics/ also says "null values aren't stored". Bloggers have picked up on this, after questions on SO have remained unanswered.

However, it's very easy to insert nulls by omitting fields in HTTP requests,

curl -i -XPOST 'http://localhost/write?db=NOAA_water_database' --data-binary 'h2o_feet,location=github water_level=42'

or when importing data:

h2o_feet,location=coyote_creek water_level=1

Both insert nulls in InfluxDB 1.0.0.

https://docs.influxdata.com/influxdb/v1.0/tools/api/ also shows plenty of null values.

A sane feature request, Add boolean operator to check for null values (and its inverse, i.e. not null) was closed without explanation.

Given that DELETE FROM WHERE field=... doesn't work, the inability to handle nulls makes it super cumbersome to SELECT INTO in order to strip bad points from a series, if your points contain nulls.

dandv commented 7 years ago

FWIW, the competing Axibase does support NULLs.

beckettsean commented 7 years ago

@dandv those nulls are not stored in the system, they are simply filled in when the JSON is returned. There is literally no way to check the data on disk for nulls, as there just isn't anything at all stored in that event.

Both insert nulls in InfluxDB 1.0.0.

No, they don't. I'm not sure why you think that they do, but there are no nulls stored in the system.

Given that DELETE FROM WHERE field=... doesn't work, the inability to handle nulls makes it super cumbersome to SELECT INTO in order to strip bad points from a series, if your points contain nulls.

Perhaps you can describe what you are trying to do, with syntax, and we can help? SELECT INTO won't remove data.

desa commented 7 years ago

@dandv I think the point of confusion is coming from the changes in our storage engine.

In version 0.9, we used to support nullifying a result by writing to a different field. For example, if I inserted the point

serieskey,some=tag x=1,y=2 1

but then later decided that I wanted to redact the fields x and y, I could insert another point

serieskey,some=tag deleteme=true 1

If I then queried for this data, I would get the following.

> select * from serieskey where some = 'tag'
name: serieskey
---------------
time    deleteme    some    x   y
1   true        tag     

The reason why this worked was that data used to be store in rows, so reinserting a point in a row without specifying the other fields would nullify values.

When we changed our storage engine, we stopped storing rows and started storing columns. This meant that the tactic highlighted above no longer works. And querying for that data would result in all of the values written being stored.

> insert serieskey,some=tag deleteme=true 1
> select * from serieskey where some = 'tag'
name: serieskey
---------------
time    deleteme    some    x   y
1   true        tag 1   2

As @beckettsean notes, the database does not store null values. By simply looking at the values that are returned when querying, it does indeed look like nulls are stored.

> insert serieskey,some=tag x=1,y=2 2
> insert serieskey,some=tag x=1 2
> insert serieskey,some=tag y=2 2
> insert serieskey,some=tag y=2 4
> insert serieskey,some=tag y=2 5
> insert serieskey,some=tag x=2 5
> select * from serieskey where some = 'tag'
name: serieskey
---------------
time    deleteme    some    x   y
1   true        tag 1   2
2           tag 1   2
4           tag     2
5           tag 2   2
6           tag 2

However no null values are actually stored. All fields in InfluxDB must have some value. The reason why we report null, is because no value was found.

dandv commented 7 years ago

those nulls are not stored in the system, they are simply filled in when the JSON is returned.

By simply looking at the values that are returned when querying, it does indeed look like nulls are stored.

That makes sense; thanks for the explanations. That's why I believed there were actual nulls stored.

Anyway, from a user's perspective, this is less relevant than how to check for nulls. I've described an example use case in #718. There's an irregular time series and I'm trying to remove points where a field has a bogus value with SELECT * INTO cars2 FROM cars WHERE speed != -1. The problem with that statement is that it excludes null values for speed as well as -1 values, but points with null values for speed contain other non-null values (e.g. altitude), and those points should be kept (i.e. copied into).

I would want to SELECT * INTO cars2 FROM cars WHERE speed != -1 OR speed IS NULL.

I'm still new to InfluxDB and I might be getting nulls wrong. Some guidance in the documentation regarding nulls would be appreciated.

beckettsean commented 7 years ago

A point has one to many fields. Any point in a series may have one or more of the fields defined on that series. If a point does not have a value for a particular field, it simply doesn't have that field. It's not that there is a null for that field, it's that the field and its value don't exist on that point.

SELECT * INTO cars2 FROM cars WHERE speed != -1 The problem with that statement is that it excludes null values for speed as well as -1 values

That statement can only return points where the field key speed exists. Any other point will be ignored, because speed is undefined and therefore cannot be evaluated. Regular expressions are not valid for fields, so you also can't match SELECT * INTO cars2 FROM cars WHERE speed !~ /./, which would match all points with speed undefined.

I don't know that there is a way to search for points that don't have a field.

dandv commented 7 years ago

I don't know that there is a way to search for points that don't have a field.

How can undesired field values be filtered out of a measurement then? (I've also asked the question at https://github.com/influxdata/influxdb/issues/3210#issuecomment-249702807.)

Worse, I've found that even SELECT * INTO cars2 FROM cars doesn't produce a copy of cars, but about 30% fewer points.

desa commented 7 years ago

Whenever doing INTO queries you've got to GROUP BY * otherwise you'll end up getting collisions.

Just running SELECT * INTO cars2 FROM cars will turn any tags into field. If you reduce the number of tags that are in a series, points with similar timestamps will collide. If you do SELECT * INTO cars2 FROM cars GROUP BY * they'll remain tags and everything should work appropriately.

How can undersired field values be filtered out of a measurement then? (I've also asked the question at influxdata/influxdb#3210 (comment).)

By undesired field values, do you mean they have some value, but not the correct value and you'd like to delete them?

Like in the following example:

insert cpu,host=server1 value=10
insert cpu,host=server1 value=1000
insert cpu,host=server1 value=99

And I decided that I don't want to keep and of the values above 100?

If so then

> select * into mem from cpu where value < 100 group by *
name: result
------------
time            written
1970-01-01T00:00:00Z    2

Followed by

> drop series from cpu

Should give me what I want

> select * from mem
name: mem
---------
time                host    value
2016-09-26T21:47:47.902876316Z  server1 10
2016-09-26T21:47:48.617861161Z  server1 99
desa commented 7 years ago

That being said the drop series, operation can be expensive.