pchampin / sophia_rs

Sophia: a Rust toolkit for RDF and Linked Data
Other
210 stars 23 forks source link

Take loader_factory instead of loader in `JsonldOptions` #143

Closed damooo closed 6 months ago

damooo commented 8 months ago

Current JsonLdOptions<L> takes a loader of type L, and wraps it in a Mutex. It effectively makes jsonld parser non concurrent, as every concurrent op have to wait anyway till others complete.

Instead, it will be great to take a loader-factory fn() => L, that can return fresh loader on every call. As most loaders can be cheap to clone or construct, it will enable concurrency without much cost.

pchampin commented 7 months ago

You are absolutely right, wrapping loaders (which have an async API) into a (sync) Mutex is an aberration. My bad!

However, I'm not entirely convinced by your assertion that loaders are "usually cheap to clone or construct". Some loaders may maintain an in-memory copy of frequently used (and possibly big) contexts. Similarly, HTTP loaders should rely on some form of caching (which could be on disk or in memory). Copying those in-memory caches would be inefficient. The alternative would be to resort to some kind of mutex inside the loaders...

So I would keep the idea of wrapping the loader in a mutex, but use futures-util's async Mutex instead. Would that make sense?

Could you maybe make a PR with these changes?

damooo commented 7 months ago

However, I'm not entirely convinced by your assertion that loaders are "usually cheap to clone or construct". Some loaders may maintain an in-memory copy of frequently used (and possibly big) contexts. Similarly, HTTP loaders should rely on some form of caching (which could be on disk or in memory). Copying those in-memory caches would be inefficient.

The rust pattern for this is to use interior mutability, wrapped in an Arc which is cheap to clone. For example, The moka's in-memory cache, is cheap to clone, and all clones are backed by same effective cache. From their docs:

Cloning is a cheap operation for Cache as it only creates thread-safe reference-counted pointers to the internal data structures. Don’t wrap a Cache by a lock such as Mutex or RwLock. All methods provided by the Cache are considered thread-safe, and can be safely called by multiple async tasks at the same time. No lock is needed.

And for on disk cache, like CaCache too need just path to cache.

Manas has an implementation of http loader for the exact purpose. Here it was instantiated with moka inmemory caching middleware factory:

https://github.com/manomayam/manas/blob/7a5748925ba76ed5546cf5647bf10d203e98a58d/crates/manas_server/src/recipe/impl_/single_pod/mod.rs#L83-L99

damooo commented 7 months ago

So I would keep the idea of wrapping the loader in a mutex, but use futures-util's async Mutex instead. Would that make sense?

That still makes the parser "effective" non concurrent, as each parser usage effectively have to wait for others to complete.

It would be really practical to use a loader factory. If it is considered okay, i should make a pr with that.

damooo commented 7 months ago

The sync mutex is perfectly fine, as lock is acquired in non-async context. It will not block runtime.

Issue was not about parsing task blocking the runtime, but about inability to run multiple parsing tasks concurrently using the same parser, as all tasks have to wait for the lock on config mutex.

May be, that's fine enough, as we can instantiate new parser for each task. Mentioning in docs will be greatly helpful though.

pchampin commented 7 months ago

The sync mutex is perfectly fine, as lock is acquired in non-async context. It will not block runtime.

Indeed. I stand corrected. Same for caches being cheap to clone.

Issue was not about parsing task blocking the runtime, but about inability to run multiple parsing tasks concurrently using the same parser, as all tasks have to wait for the lock on config mutex.

I see that now.

May be, that's fine enough, as we can instantiate new parser for each task. Mentioning in docs will be greatly helpful though.

Since this would be specific to the JSON-LD parser, while all other parsers are "concurrency friendly", this is probably not the best option.

I'm reopening this issue, and I'm welcoming a PR replacing "loader" with a "loader_factory" :)

pchampin commented 6 months ago

This was actually included in v0.8.0