Closed oriubimo closed 4 years ago
1) Because I think it's easier to have everything per fiber than some per fiber and some per thread.
2) I think it is better to avoid global variables. global variables make scope more difficult to understand, because of them we made this mistake in the beginning. My solution is also faster if you read metadata. Finally, my solution doesn't encourage you to change the current pipeline.
1) Weil ich glaube, dass es einfacher ist, alles pro Fiber zu haben, als Manche pro Fiber und Manche pro Thread.
2) Ich glaube, dass es besser ist, globale Variablen zu vermeiden. Globale Variablen machen Scope komplizierter zu verstehen, ihretwegen haben wir diesen Fehler am Anfang gehabt. Auch ist meine Lösung schneller, wenn man Metadata liest. Endlich fördert meine Lösung nicht, dass man die aktuellen Pipelinien ändern.
Where do we have global variables that you avoid here?
raw context has other goal besides providing metadata service for operators: it also supplies output interface for writing the objects. The way it was designed is that each raw context objects has its own buffers per output shard. i.e. we have N*(num-threads) output buffers. Now you break this assumption - you create N buffers per fiber per thread. This is not needed, per thread/shard buffering is sufficient.
To summarize, I prefer not to pursue this direction and in general I prefer avoiding the pattern where something uses an object but for some reason it's owned by some other entity, thus creating one-to- many relationship like with raw_contexts
. Sometimes it's necessary but not in this case.
When I say globals I refer to any variable that's visible from everywhere, I'm calling non-local thread-locals globals too. If you want me to use fiber-local, how do you think we can solve the efficency problem? Or do you think it's ok if I just ignore it?
(You said fiber-local uses a slow map object)
Ok, ignore my comment, I read your comment on JIRA just now, I understand how to solve this.
struct PerConnection { string file_name; size_t file_pos; ....} // FiberLocal struct
PerConnection* OperatorExecutor::GetPerConnectionData() { return per_io_->raw_context->per_fiber_connection.get();}
in MapperExecutor::MapFiber:
PerConnection* per_conn_obj = GetPerConnectionData(); // cache the pointer to metadata for this fiber.
....
instead of SetFileName(...): per_conn_obj->file_name = rec->second;
....
in do_context.h:
const std::string& input_file_name() const { return per_fiber_connection->file_name_;}
....
accessors of raw_context will become slower but the operator that updates metadata will cache the pointer to the fiber-local object so its updates will be fast. I think it's good enough and if it won't be good enough we could create fiber-local RawContextExt that wraps thread-local rawcontext, i.e. to move fiber-local metadata from rawcontext to a fiber-local object that wraps it. I think it will require more work but those 2 directions that I would focus on and choose one of them to pursue.