vmware-archive / xenon

Xenon - Decentralized Control Plane Framework
Other
227 stars 97 forks source link

optimizations, bug fixes from private fork #26

Open georgechrysanthakopoulos opened 7 years ago

georgechrysanthakopoulos commented 7 years ago

i am no loner tied to xenon main line but i will occasionally open Issues for things i have fixed/improved on my xenon variant:

1) NodeGroupService merge logic has a flaw: it ignores update from remote node, if it has marked it UNAVAILABLE more recently than it, itself report available. this can happen due to clock drift and has a trivial fix: do this in the merge function:

the key is the new remoteEntry.equals check, which will accept the change if the reporting node is the owner for that NodeStatus entry

            if (remoteEntry.documentVersion == currentEntry.documentVersion && needsUpdate) {
                // pick update with most recent time, even if that is prone to drift and jitter
                // between nodes, except, if the remote entry is the owner for this node status
                if (!remoteEntry.id.equals(remotePeerState.documentOwner)
                        && remoteEntry.documentUpdateTimeMicros < currentEntry.documentUpdateTimeMicros) {
                    logWarning(
                            "Ignoring update for %s from peer %s. Local status: %s, remote status: %s",
                            remoteEntry.id, remotePeerState.documentOwner, currentEntry.status,
                            remoteEntry.status);
                    continue;
                }
            }

2) Change the ServiceHost executors to use async mode for fork join pool. Gives a small perf boost (5->10%)

3) Instead of using pragmas to express NO_INDEX_UPDATE or FORWARDING_DISABLED, i expanded OperationOption and added INDEXING_DISABLED, FORWARDING_DISABLED and then modified all places that check for the pragmas to also check for options. This removes lots of alllocation for operations that will be in the same host anyway (during service stop for example)

4) Factory service minor change to NOT forward, ever, for a direct client POST, if the child service does NOT have owner selection, but is just replicated instead. This works well for pure replication services, and an external load balancer, that already load balanced to different nodes

5) new ServiceOption.CUSTOM_INSTRUMENTATION. 10% boost in throughput in stateful services that do not need the core operation tracking stats, but DO need custom stats. It requires changes all over, but udner the covers, and its backwards compatible (since it is a new option). AVAILABLE, CREATE_COUNT, etc all core stats use the CUSTOM_INSTRUMENTATION option so they dont force stats on all services, by accident, which is what happens today

6) removed web socket support. Not used now that we have SSE

7) reduction in allocations in FactoryService, by removing the PARENT replication header, just using a parentUri field in Operation.remoteCOntext

8) moved SSE handler fields to Operation.remoteContent so we dont bloat the size of ALL operations, even just internal ones!!

this is all just fyi. i cant create pull requests anymore since my version of xenon has diverged (faster, smaller, not backwards compatible due to method removals and renames).

I still pull manually changes from main xenon, but i cant push back my changes.

fyi @sufiand @asafka @gbelur @toliaqat @ttddyy

you can leave this Issue open, and i can post updates on occasion. Feel free to create pivotal tracker items for individual items I report

georgechrysanthakopoulos commented 7 years ago

another one: LuceneDocumentIndexService, on bulk document expiration sends DELETE requests, without the NO_FORWARDING pragma. This means, that when the same document expires across N nodes in a node group, we issue DELETEs from all nodes, to all nodes, causing a lot of wasted churn

georgechrysanthakopoulos commented 7 years ago

@jvassev fyi on the above

asafka commented 7 years ago

Thanks Geoge, we will take a look.

asafka commented 7 years ago

Sorry for the typo George.