raduprv / Eternal-Lands

http://www.eternal-lands.com
Other
155 stars 56 forks source link

Actors list lock #164

Closed gvissers closed 2 years ago

gvissers commented 2 years ago

This pull request adds mandatory access locking for the actors list, i.e. it is no longer possible to modify the actors list or an individual actor without locking the actors list first. Well, technically, since the client is not written in Rust, one can hold on to an actor pointer after unlocking the list, and modify it then. So it has only gotten harder, not entirely impossible. This improves upon the current situation, where the actors list is locked voluntarily, with the result that in many places it is not locked at all before an actor is modified.

Unfortunately, at least for the time being, this code defaults to using a recursive lock. The single reason for that is add_sound_object_gain(), which is deep down in the call chain, and is called both with the actors held and without. It was not feasible to pass down the required information through all call paths, so for now this function locks the actors list again if it is already locked.

If there are no objections, I would like to merge this branch soon-ish, before we merge a lot of other changes, as it is rather large and touches a lot of code.I have been using these patches for a while now, and believe I have shaken out the bugs. However, if someone could eyeball the changes, that would be great. An in-depth review may be a bit much to ask for, due to the size of the patch :(

pjbroad commented 2 years ago

We ought to discuss whether we want to do a bug fix release for 1.9.6.0 before making major changes. Also we have the Android branch to merge which will be equally broad. Not sure which we should do first though I'd hope they do not overlap too much. I'll do a look through this request though.

gvissers commented 2 years ago

We ought to discuss whether we want to do a bug fix release for 1.9.6.0 before making major changes.

Sure, we can do a bugfix release first if that's necessary. This PR is more of a heads up that I would like to merge this before we start doing a lot of other work.

Also we have the Android branch to merge which will be equally broad. Not sure which we should do first though I'd hope they do not overlap too much.

I merged this branch into the Android branch (locally) at some point while trying to find the cause of the crashes on Adreno systems. IIRC the merge was not too bad. I don't think it matters much which we do first, the conflicts will probably be roughly the same. Let's do Android first, I expect the merge with master to be more painful.

I'll do a look through this request though.

Great, thank a lot!

gvissers commented 2 years ago

Unless someone has a reason not to, I would like to merge this branch at some point. Preferably earlier than later, as it digs deep into the code and affects a lot of files, so I would like to give it as much testing as possible. Also merging this early will hopefully be easier than trying to keep this branch in sync with master.

I have been using this branch for a couple of months now, without issues. Of course I don't get to test everything, so there may be skeletons in the closet that I'm not aware of. But they're buried pretty deep down, I think.

@pjbroad , I believe you said at some point you wanted to review this, have you ever had the chance? If you still want to give it a lookover, I'm happy to wait for it, but if you're comfortable merging it, I'm going to go ahead and squash merge it.

pjbroad commented 2 years ago

I will be able to take a look over the weekend. Is that OK?

gvissers commented 2 years ago

No problem at all, take your time. I just wanted to make sure it's moving forward.

pjbroad commented 2 years ago

I've done my best to check this through, sorry its taken so long. The level of churn is rather large.

pjbroad commented 2 years ago

I've done my best to check this through, sorry its taken so long. The level of churn is rather large.

gvissers commented 2 years ago

The level of churn is rather large.

Yes I realise that, so thank you very much for ploughing through it all. :bow:

You raise some very good points, I will try to fix them all and will update this thread then.

gvissers commented 2 years ago

Hey Paul,

Thanks again for the review. I have tried to address your comments:

  • The code does not compile with ECDEBUGWIN debug enabled.
  • In new_character.c keypress_namepass_handler() the return value from lock_and_get_self() is not checked. It seems you do that everywhere else so I presume it's an omission.

These should be fixed,

  • There are a few places you acquire the lock for an extended period. auto_save_local_and_server() is one which likely could be improved as I don't think its needed for the duration. It only checks for the player fighting to avoid lag. The counters code has the lock a lot.

I have fixed auto_save_local_and_server(), there is indeed no need to do a bunch of IO under lock. It's interesting you mention the counters though. AFAICT, only increment_death_counter() (and the functions it calls) is run under lock. But that needs to be under lock, since the counters rely heavily on the actor positions to determine who killed what. (which should really be done by the server, but that's a separate discussion). If you have other examples where you think the lock could possibly be released earlier, I'd be happy to look at them, though.

  • There are quite a few places with the same code that read an actors coordinates. I wonder of a common function could be provided for this and potentially hold the lock for less time as result.

As to the first, I have created helper functions to get that actor's own tile coordinates and world coordinates. More on the latter below. I don't think thy will help in reducing the time under lock though, since the places where they can safely be used alread release the lock as soon as possible. It does reduce a bit of churn though, and makes for a better read.

  • For the functions that get the lock and an actor pointer like lock_and_get_XXXX( address of actor pointer ) then free the lock using release_locked_actors_list(), I wonder if the actor pointer could be set to NULL when no longer locked. It might quickly highlight issue if the pointer is used later by mistake.

At first, I tried to have the lock_andget() functions return a wrapper structure, and so force* the invalidation of the actor pointers. The wrappers were very unwieldy though (you'd be taking member fields everywhere). Also, there are quite a few cases where another pointer is taken from the actors list afterwards, and that would need to be invalidated as well. So I have created two helper functions, release_actors_list_lock_and_invalidate() and release_actors_list_lock_and_invalidate2(), that release the lock and set one or two actor pointers to NULL. The burden is still on the programmer to call these functions, but it should be easier now.

As for the actor's world position: so far only the landmines debug code calls self_position(). The reason for that is that I think the other potential callers would be better off using the actor's tile position instead:

I did not touch these yet, as I was not 100% sure the world coordinates and tile position are always in sync, and there might be a good reason the world coordinates are used. If you think there is none, I will convert them to tile coordinates and have the code call self_tile_position() instead.

pjbroad commented 2 years ago

Thanks for doing the updates. The various helper functions clean things up nicely.

Anyway, good (and patient) work, I'd say merge!

gvissers commented 2 years ago

Thanks, I've removed the comment. Upon further thinking it really makes no sense to worry about the actor moving away while taking stuff from the bag. The server will allow it or not, either way, it's not really the client's problem.

I've merged, thanks for reviewing!