trellis-ldp / trellis

Trellis is a platform for building scalable Linked Data applications
https://www.trellisldp.org
Apache License 2.0
105 stars 21 forks source link

Question: Parallelization and Memento (in particular) #289

Closed atz closed 5 years ago

atz commented 5 years ago

To what extent can Trellis be parallelized? I.E., multiple systems running against the same persistence. Using trellis-ext-db seems to get part of the way there.

My outstanding question regards /opt/trellis/data/mementos. In the dockerized Trellis docs (with external DB), it seems to be the only persistence that needs both:

My understanding of Memento is limited. The RFC defines a bunch of interactions between client and server, but my question is "below" the server level, since the client should be unaware of whether the app is parallelized or not. My reading is that the RFC certainly would be violated if two systems write to the same resource/memento and because their clocks differ, the second write is stamped as appearing before the first. There doesn't seem to be any flexibility on "sameness" of server system(s) perception of time. Is that correct?

If so, then using a common DB and delegating the perception of NOW to the DB could help resolve the question. I read in another issue that the MementoService can be configured to reuse the same external relational DB as the app. Two questions:

I could be missing something more architectural (like designating a single instance as TimeGate), or 100 other things, so any feedback is appreciated.

ajs6f commented 5 years ago

@atz I know little about the RDB backend, but from a theoretical POV, I'm not sure your reading of Memento is correct. Keep in mind that the Memento RFC does not define write behavior, so this is an area that we (and several other projects) are currently exploring with muck boots and mosquito veils.

I understand the timestamp of a Memento to be entirely determined by the server, not by client behavior in any way, so I encourage you to bring this up at the Memento Google group and let us know what results. @hvdsomp is usually very willing to answer questions and @martinklein0815 has been so kind as to check in on and speak to issues here at Trellis.

acoburn commented 5 years ago

The application that is built for the trellis-ext-db project uses the trellis-file component for both Binaries (LDP-NR) and Mementos: these resources are stored as flat files. This is a good persistence layer for single-node implementations, but for multi-node deployments, you would need a clustered filesystem to handle this properly. (NFS would also work, though it would be a performance bottleneck).

Trellis is structured in a way that the core of it just presents a Java API and there can be different implementations of those APIs (here, the file-based Memento storage is just a simple implementation). The interfaces are, generally, pretty simple (the resource layer is the most complex), and the memento interface is one of the more simple to implement.

While file-based persistence is useful in certain contexts, it is definitely problematic in other contexts. One thing you might want to take a look at is the trellis-ext-aws project, which implements an S3-based MementoService, along with an S3-based BinaryService and a SNS-based NotificationService. That project is still maturing, but for multi-node (especially cloud-based) deployments, that kind of structure (avoiding files) will be much more appropriate.

acoburn commented 5 years ago

And to the point about parallelization, the Trellis HTTP layer is built in such a way that it is entirely stateless, which means that -- so long as the persistence layer(s) are external to a given web node -- there can be an arbitrary number of web nodes running. For that sort of deployment strategy, a distributed (persistence) backend clearly makes the most sense, and that's where things are heading with the Cassandra-based backend.

atz commented 5 years ago

I definitely looked at trellis-ext-aws, which would be closest to our use case, but my interpretation was that the implementation was not yet complete for production use.

@ajs6f:

I understand the timestamp of a Memento to be entirely determined by the server

Right, but with parallelization, what the RFC defines as one (logical) server is actually several. The presumption that they all always have exactly the same system time is impossible. Maybe Trellis' Memento writes happening only downstream from a queue or DB means it isn't as much a problem as I perceived? That is a level of detailed implementation knowledge that I can't speak to.

ajs6f commented 5 years ago

@atz I was thinking of the client - server relationship between the frontend(s)(running Trellis) and the backend(s) (running some kind of persistence service, e.g. a database of some kind, like Cassandra). If, as you quite rightly write, we cannot speak of a single timestamp there, there is no way to specify the behavior to the original (HTTP) client. If you look at ResourceService you will see that Trellis (very intentionally) does not assume synchronicity, and in large part that is to avoid putting constraints on the implementation.

ajs6f commented 5 years ago

@atz I'm not sure where we are at with this question, and I don't want to leave you hanging.

As we've described above, Trellis very intentionally hands questions of time directly to persistence and webapp instances are always share-nothing, so there is no synchronization at the HTTP layer (and never will be). In that sense:

delegating the perception of NOW to the DB

(as you wrote) is done. That's how we roll now.

If there are ambiguities or gaps you see here, can you write a little more about them? These are difficult questions and I'm sure we haven't got all the corners and edges covered.

martinklein0815 commented 5 years ago

@atz I know little about the RDB backend, but from a theoretical POV, I'm not sure your reading of Memento is correct. Keep in mind that the Memento RFC does not define write behavior, so this is an area that we (and several other projects) are currently exploring with muck boots and mosquito veils.

I understand the timestamp of a Memento to be entirely determined by the server, not by client behavior in any way, so I encourage you to bring this up at the Memento Google group and let us know what results. @hvdsomp is usually very willing to answer questions and @martinklein0815 has been so kind as to check in on and speak to issues here at Trellis.

My apologies for the late response! @ajs6f is correct in that the client does not determine the datetime of an archived resource. Memento does not address the notion of parallelization of write processes on the server's end but expects a resource to have one datetime. How multiple writes to the same resource are handled is beyond the scope of Memento. I may be missing something in this discussion (and I am definitely not a Trellis expert) so I'd echo @ajs6f's suggestion to post your use case to the Memento Google group for further discussion.

atz commented 5 years ago

delegating the perception of NOW to the DB

(as you wrote) is done. That's how we roll now.

That's enough to answer my question, thanks!