sparkfun / phant

the data logging engine behind data.sparkfun.com
700 stars 98 forks source link

Records should store an '_id' field #103

Closed dpjanes closed 6 years ago

dpjanes commented 10 years ago

Issue: When there are readers and writers on a stream at the same time, you cannot guarantee readers that using 'offset' to page chunks of data will not result in duplicates.

Background: Consider a large data stream that has to be read in pages. Let's say 50 results are returned in the first page. While that chunk is being processed, another write adds 10 records. When the reader goes to get the next chunk (starting at 50) the first 10 results in the next page will be duplicates from the previous page of results. The 'timestamp' field is not guaranteed to be unique and thus there's no real way to avoid this.

Proposed solution: Add an ID field called '_id', in which a monotonically increasing value be added automatically. One possibility for this value is "%08x-%08x" % ( current time in seconds, server session incremented value ), though other things are reasonable.

toddtreece commented 10 years ago

@dpjanes the timestamp is stored in milliseconds, so it's highly unlikely that the timestamp wouldn't be unique on a row given the extra time it takes to process each request. the reason i didn't add a separate id originally is because it takes up extra space for every row, and i wanted to keep the row size as small as possible. if you think it's still essential, then we can add it, but it seemed like overkill to me. let me know what you think.

dpjanes commented 10 years ago

I noticed this because I was getting duplicate timestamps on my computer. You could of course - as an alternative - define that all timestamps have to be unique but this is not only a hack but could lead to other issues.

dpjanes commented 10 years ago

It could be a configuration setting?

toddtreece commented 10 years ago

@dpjanes I wasn't talking about forcing uniqueness, I just was saying that under normal use it doesn't seem like it would be an issue. A config setting sounds like a good option, and i'm guessing a v1 GUID would work for the actual id value.

dpjanes commented 10 years ago

There should be a way that you are guaranteed to download the data exactly as it is stored on the server. If GUIDs are used, you don't get an ordering when there's a timestamp collision.

toddtreece commented 10 years ago

I think the way to guarantee the order is to download the full file. I think a incrementing numeric ID is a good idea, but I'm not sure of the best way to ensure a solid incrementing ID across clustered instances of phant, or multiple phant servers behind a load balancer. It could be done with something like increment feature of Redis, but that would require a separate implementation for the most basic of phant configurations (since we don't want any external dependencies).

I'm not sure why GUID wouldn't work. You could just go "give me records 0-10", then grab the last GUID from that set and say "give me the next 10 records starting at this GUID". The server will always return them in order since it's just a stream of the stored data.

brennen commented 10 years ago

Yeah - and I think beyond that, if you need some other form of sequence guarantee within the same ms (like if you plan to later store stream data elsewhere in some non-ordered fashion, I guess), clients can increment some rotating counter and push that as a field for later use. I concur that providing an absolutely guaranteed monotonically increasing id field is probably not worth the extra complexity and overhead.

dpjanes commented 10 years ago

I think the _guid + "get things after this _guid" option works. My main concern is being able to download the complete stream in order, I think this basically satisfies this without adding any assumptions as to the contents of what the an ID would hold.

toddtreece commented 10 years ago

Sounds good. I'll add it to the next release.

liudr commented 10 years ago

I am pretty much a novice in this scene so by chunk do you mean the page=number option with the data download? If I understand correctly, the pages are static once a page fills. Say if page 1 is being filled, then page 2 will always be the same whenever you download, instead of a rolling page that is one page older than the most recent upload. Also, when data exceeds limit, they are deleted one page at a time, correct? Any way to download more specific ranges than one page at a time, say, m pages at a time or data uploaded between timestamp x and y? I don't know how to combine two pages with javascript, still learning. Never liked the loose cannon format or syntax in an around 2000.

Thanks.

bboyho commented 6 years ago

Phant is No Longer in Operation

Unfortunately Phant, our data-streaming service, is no longer in service and will be discontinued. The system has reached capacity and, like a less-adventurous Cassini, has plunged conclusively into a fiery and permanent retirement. There are several other maker-friendly, data-streaming services and/or IoT platforms available as alternatives. The three we recommend are Blynk, ThingSpeak, and Cayenne. You can read our blog post [ https://www.sparkfun.com/news/2413 ] on the topic for an overview and helpful links for each platform.