josephg / editing-traces

Real world text editing traces for benchmarking CRDT and Rope data structures
39 stars 2 forks source link

Timestamps on concurrent traces #7

Closed josephg closed 7 months ago

josephg commented 8 months ago

I want to add a simple timestamp field on concurrent editing traces because some systems store & forward this information.

josephg commented 7 months ago

@ept & others: Do you have an instinct for the resolution of timestamp data? I recorded another multi user editing trace yesterday and recorded timestamps for all edits at millisecond precision, but I’m worried about leaking personally identifiable typing signature information.

How does second precision sound? Obviously implementations can use a lower precision when importing events …

ept commented 7 months ago

Automerge stores timestamps with 1-second resolution by default. It's a fairly arbitrary choice, based on the fact that I couldn't imagine any legitimate use for higher-resolution timestamps (and indeed I think there has been research showing that high-resolution key timing information can identify who is typing). For history visualisation purposes, even 1-minute resolution would probably be fine, but I don't think 1-second resolution will do any harm (other than slightly increasing the file size).

How do you want to store the timestamp in your trace format? The automerge-paper trace, which you currently have listed as sequential, actually originally included some concurrency, and also timestamps (I just flattened out the concurrency and removed the timestamps to make it easier to work with). I thought I could also extract the concurrent+timestamped version of the same trace in your format from the original data files.

josephg commented 7 months ago

Cool, I’ll go with 1 second timestamps then.

The most straightforward idea might be to put timestamps on each patch. So instead of [pos, delete_len, ins_content] have [pos, delete_len, ins_content, timestamp]. It’s a lot of fields for a list like that - I inherited the format from your automerge-perf repo.

The other approach would add a timestamp on each transaction object (which internally stores a run of changes from an agent). But that’d make it so all the patches inside each of those objects would implicitly have the same timestamp. And I don’t think that really makes sense. We have the timestamps. I think it’s better to leave it up to the users of the data to decide what granularity they want to store / use when benchmarking.

ept commented 7 months ago

Yeah I was also wondering whether to put it on the transaction or the patch. Fine with putting it on the patch as you suggest.

josephg commented 7 months ago

Yeah I think that’s better given the transactions are broken up somewhat arbitrarily - at least for text objects. And it should work for both the concurrent editing trace format and the fully sequential traces, since they share a patch format.

People using these data sets can always drop some of the data. But we can’t add it back.

If that works for you I’ll add timestamps like that then. And add them for my new data set.

josephg commented 7 months ago

Alright; I've made that change. I've also retroactively changed the format of the sequential traces to also have a timestamp on each patch, and converted all the existing traces into this format.

@streamich Hope this isn't too disruptive of your code!

The new clownschool editing trace contains per-character timestamps now as well, which is quite nice.

streamich commented 7 months ago

Thanks for adding the "clownschool" trace, it looks indeed to have a lot of concurrency in it, which is nice to have.


Regarding adding the timestamp to patch instead of transaction, I'm not sure. I currently don't have a use case for timestamps out of these traces, but, in general, the "transaction" has it right there in the name, at least to me. I would expect a "transaction" is an atomic unit of change, it applies as a whole and instantly. At least, that is how I was thinking about these traces. And "patches" are like multiple "operations" within one transaction, for example, if you did a change using multiple simultaneous cursors.

But, again, I currently don't have a use case for timestamps, so. How do you use the timestamps?


I will have many concurrent editing traces soon, as there will be an easy way to create and store them on json-joy site. But they will be in JSON CRDT Patch format. I am thinking, maybe it is worth for json-joy to also support the FRH trace format. Though not sure what would be the use case.

If json-joy would support the FRH format, it would need to be more complex than the concurrent trace format in this repo: (1) there should be a binary version; (2) it should also support JSON operations (in addition to text).

Is there any appetite for having a common FRH format? What are the benefits of storing data in it?

josephg commented 7 months ago

I’m not using the timestamps either; but @ept asked if we can add them because automerge stores timestamps.

Re: transactions, I hear what you’re saying and I find your argument pretty convincing. I don’t have a strong opinion about this because I don’t store transaction boundaries in diamond types at the moment anyway. I rely on synchronous editing locally and when peers send changes to one another, remote changes are merged “all at once” so there’s no possibility for an application to see nonatomic changes. Atomicity boundaries also don’t mean a lot for text editing. I might eventually add them for json editing.

But in the clown school trace I can figure out user editing transaction boundaries by looking at timestamps and put that information into the trace. That would also mean the sequential traces don’t need to change. The data format would be a bit bigger as a result (since we’d have more transaction objects) but that’s fine.

What do you think? If that format works better for you I’ll switch back to timestamps on every transaction, and assuming @ept doesn’t mind. It doesn’t affect my code.

And re: having a common format, I think that’s a great idea. Wanna start a fresh issue to talk about it? I’m already using a different json format for exported json data in diamond types with an agent id. Might be lovely to standardise that!

Btw, sorry about the churn but after thinking about it a bit I decided to go with the name REG (replayable event graphs) instead of FRH. I think it’s a better name - it’s easier to say, and it’s more descriptive of what’s going on. I hope that works for you!

ept commented 7 months ago

I don't have a strong opinion on how it should be structured, but if the idea is to evolve the format to support more than plain text editing (JSON, rich text, etc), then it would probably make most sense to place the timestamp on the transaction. For the current text editing traces, each transaction would then only contain a single patch/operation, but once you generalise the model to other datatypes, it's likely that there will be transactions with many operations (for example, in a JSON document representing a vector graphics drawing, a user may select multiple objects and change all of their positions at once by dragging them). I would represent that update as a single transaction, with a single timestamp, containing multiple patches/operations. This would also closely match the way Automerge does things.

Regarding why we collect the timestamps: we want Automerge to support version-control-like use cases, where you use the editing history to visualise e.g. what your colleague changed during the week you were on vacation, or to recover an old version of the document if you change your mind about the way it was edited. Some of our vision for this can be found in our Upwelling paper. So far Automerge doesn't really have good APIs for exposing this kind of functionality, but Ink&Switch is about to spin up a project which will take a closer look at version control and the underlying CRDT functionality it requires. Our thinking is that it doesn't take too much extra space to store the timestamps, and it enables a whole bunch of interesting possibilities for edit history visualisation.

josephg commented 7 months ago

Cool, in that case I'll revert the change to the sequential editing traces and move timestamps onto each transaction for concurrent editing traces too. The JSON files will be a bit bigger, but thats not a major concern for these data sets.

josephg commented 7 months ago

Done! I've updated the concurrent editing traces such that each transaction contains a single timestamp and exactly 1 user facing editing event. The clownschool.json trace contains several transactions with multiple edited characters.

I've also re-exported the "flattened" versions of these concurrent traces in the same style.

For consistency, it'd be nice to add per-operation timestamps on the automerge-perf trace as well. Timestamps were captured in the repository - we just need to parse them back out.