{error, tid_not_found} when encoding/decoding

jinganix / enif_protobuf

A Google Protobuf implementation with enif (Erlang nif)

38 stars 21 forks source link

{error, tid_not_found} when encoding/decoding #8

Closed bwalkowi closed 5 years ago

bwalkowi commented 5 years ago

Hi,

sometimes I get tid_not_found during either data encoding or decoding. After eliminating the option that messages were invalid I take a look at enif_protobuf code. If I understand it correctly some space is allocated in state (exactly erlang:system_info(logical_processors) slots) so that each thread can keep it's data there. When new thread comes it simply takes one of the empty slots. If no slot is available then {error, tid_not_found} is returned. Is it possible that there can be more threads than logical_processors? or that threads die and are respawned with dirrefent tid? Either way, would it be possible to clear state->tdata so it can be populated again instead of returning error? or add function to do so (e.g. purge_tdata) from erlang?

jinganix commented 5 years ago

Yes, you understand correctly. We can clear state->tdata and populate it again.

Attaching data to every thread as the thread-specific data, maybe it's a better way to solve the problem. However, I failed to find a proper way to get and set thread-specific data.

There is another way. The ep_tdata_t is not thread specific. We can pop or create tdata from a synchronized pool. Push it back after encoding or decoding for reuse.

bwalkowi commented 5 years ago

Synchronized pool sounds good. I thought ep_tdata_t was thread specific since ep_lock_t that points to it also keeps tid and one thread always uses ep_lock_t it previously marked. But if it is not then synchronized pool should fix the problem. It will create new tdata only if there are more than logical_processors threads trying to encode/decode msg at the same time.

On the other hand, if it is not thread specific and can be reused then cleaning it should be as simple as setting state->lock_used to 0 (or 1 and immediately taking first slot for current thread)? With this no new memory would need to be allocated. Also I assume it should be rather rare so populating it again shouldn't be expensive.

What do you think?

jinganix commented 5 years ago

Have you changed +S Schedulers:SchedulerOnline options? My cpu has 8 logic cores, when I use +S16:16 options the error will occur. Changing erlang:system_info(logical_processors) to erlang:system_info(schedulers) will fix this.

We could not simply clear state->lock_used and take the first ep_tdata_t for reuse, because it might be used by other threads. We should lock it at least.

bwalkowi commented 5 years ago

erlang:system_info(schedulers) is better limit than erlang:system_info(logical_processors). Then again, I wonder if situations where threads dies without taking down whole system and are respawned with new tid are possible. Even if number of threads is constant tid can change. Maybe beside changing logical_processors to schedulers we should also use synchronized pool/stack as state->locks (e.g. linked list). Then each thread can pop one slot from it and push it back ones it finishes encoding/decoding. Since there shouldn't be more threads than schedulers we can allocate all slots at load time.

How about this?

jinganix commented 5 years ago

Yes, it's a robust method. But if we use a pool, we should lock it twice when every encoding or decoding. Your previous method that clearing state->lock_used is better, by carefully comparison.

Because when state->lock_used == state->lock_n, we don't need any lock. If any tid is changed, then no tdata will be resolved, we just wait for the state->cache_lock then clear state->lock_used.

bwalkowi commented 5 years ago

Sure, if you're fine with this I will try to implement it and create a pr after testing.