Vendor or scrap lazy mode?

shashi / FileTrees.jl

Parallel computing with a tree of files metaphor

http://shashi.biz/FileTrees.jl

Other

88 stars 5 forks source link

Vendor or scrap lazy mode? #79

Open DrChainsaw opened 1 month ago

DrChainsaw commented 1 month ago

The writing has been on the wall for a long time for Daggers lazy API (which is what powers FileTrees lazy mode) and in the latest release it gives deprecation warnings when used.

I personally use the lazy mode quite a bit in my workflows so I wouldn't mind adding a simple lazy computation framework in FileTrees. I'm not sure however if it is just because 1) I'm used to it and 2) I underestimate how much effort it would be to maintain it.

I haven't given it much thought, but it seems that a lightweight lazy computation framework which just recursively executes the thunks would not be much extra work. Getting parallelism on this could then just be to put Dagger.@spawn in front of every call when executing or something.

I guess this also gives the opportunity to have Dagger as an weak dependency. In about 90% of my use cases I don't have any use for parallelism and if this is similar for other users it could reduce the weight of the package by quite a bit.

Maybe one could also have a weak dependency for Distributed, although I guess that would be a slippery slope towards reimplementing the entire lazy machinery in Dagger which should probably be avoided.

@shashi and @jpsamaroo : I guess this will be a larger change to the package in either case, so I'd like to get your opinions if you have the time.

jpsamaroo commented 1 month ago

Is there a particular reason why you need to use the lazy API? And have you tried replacing usage of delayed and compute with @spawn/spawn and fetch? It should be a very straightforward transition, and would provide:

Interoperability with other Dagger code which produces DTasks
Possible memory usage reduction, as the eager API integrates more properly with the GC to allow freeing intermediate results when no longer used
Possible scheduling improvements, as the scheduler instance contains learned details about tasks that it's previously executed
Synchronization with @sync when using Dagger.@spawn
Easier passing of task options with Dagger.@spawn
The DTask is a proper task handle and supports wait, fetch, etc.

The main difference that delayed provides (starting all tasks at once) can be achieved with wrapping your code with Dagger.spawn_bulk() do ... end, which batches up tasks until it hits the end statement.

Of course, if you don't want the weight of Dagger, it would be a reasonable time to remove it as a dependency. But I do think that there could also be an opportunity to have a "generic" @spawn/spawn API as well, that can fallback to non-Dagger execution if Dagger isn't desired. If you're interested in collaborating on this, please let me know!

DrChainsaw commented 1 month ago

I'm all for making use of Dagger also in the eager loading, I just haven't gotten around to it. To me, that is a bit of a separate issue.

I use the lazy mode of FileTrees quite alot in my workflow when doing exploratory data analysis. This often involves lazy-loading data which does not fit in memory so I absolutely do not want to try to load the entire data set and then slicing around and combining in the tree interactively.

One simple to explain case when the lazy loading is very useful is when using mv to combine data from multiple files, such as in this example from the docs. It is often difficult to understand up front what the output will be, especially when working with a disorganized directory structure, so just seeing what the result is without having to load the data is pretty much a necessity, as is the ability load a tiny portion of the data after each added processing step just to make sure things are combined in the right way.

This could clearly be done without the lazy mode, but I fear that it will not feel as good to use. Also note that this does not really need Daggers lazy mode scheduling (I would never ask of you to keep it alive just for this), but I think it can be worth it to have a light weight lazy mode in FileTrees. To me, this is independent of whether Dagger is used to enable parallelism. The usefulness of lazy mode has more to do with interactivity than it has to do with parallelism for me, which is why I also think it makes some sense to vendor it.

Another common use case is to supply a FileTree to a long function which plots various aspects of the data and generates a report. The fact that it does not matter if the tree is lazy or not is extremely convenient as one can develop the function with a small dataset loaded into memory (e.g. trying out the preprocessing for each plot in the REPL) and then just supply a lazy FileTree with the larger data that does not fit and it just works. If one was using eager mode it would just start trying to load all the data (which would fail), whereas with the lazy mode only the data needed for each plot is loaded (and is then GC:able after the plot is produced and put in the report).

jpsamaroo commented 1 month ago

It sounds like I might have missed the features that you rely on when considering my response - if you wouldn't mind, could you provide an example or two (with code) of things that the lazy API lets you do that you're concerned that the eager API would fail on? That makes it more concrete for me and might help me see where the eager API is lacking.

Also, the eager API fully supports lazy-loading of data, but of course the first task that uses a piece of lazy-loaded data as an argument will force that data to be materialized, at least for some amount of time (the data can be swapped back to disk automatically). It might be that we need more tools for defining lazy/batched operations over lazy-loaded data, which might look very close to the lazy task API, so we could consider what that would look like.

In the end, it is also possible to build a full lazy API over top of the eager API - this probably seems redundant, but the lazy API exists in a way that makes the eager API harder to implement, so removing and re-adding it could lead to an overall better design that still benefits from everything the eager API offers.

jpsamaroo commented 1 month ago

Also I missed the example you posted - I'll read into it and let you know if I can come up with any solutions! Please point me to any others that you think could help me!

DrChainsaw commented 1 month ago

Thanks alot for your responses and for investing time in this issue. I really appreciate it!

Let me know if you can't make sense of the example I provided and I'll try to construct another one. Just imagine that the file structure in the example is way larger, deeper and irregular and that the data does not fit into memory, and the main thing you want to do is to combine data based on patterns in the paths. Also imagine that you do make use of FileTrees to find said patterns (e.g. by just looking at its output). :)

Just to try to be perfectly clear: I don't think anything is needed from Dagger, neither in terms of maintaining the lazy API or adding capabilities the eager.

My point of view is that FileTrees provides a lazy mode which happens to be quite convenient when making everyday use of the stuff that FileTrees does.

Up to this point, this functionality was just half-accidentally freeriding on the fact that Dagger happened to have it (this is probably not how FileTrees was conceived, but this is how it looks like for me as a user of FileTrees). Now that time has ended and I/we need to do something about it.

I saw Dagger.File in the docs, and if Dagger could be given the lazy:ish capabilities you speak of that sounds like it could be quite useful though.

jpsamaroo commented 1 month ago

Everything is really clear from the example you posted and from the documentation as a whole - it's all a great introduction to FileTrees!

I definitely see a clear benefit from using the lazy API for FileTrees - primarily the chainable nature of it is quite powerful for building up a transformation. To be clear, it seems like there are two modes - the default simulates the movement/transformation of files within the tree (allowing users to preview the result of a transformation), while adding in load/save actually performs the operation and simultaneously also allows a transformation in the "value domain" to occur. Let me know if this sounds incorrect - this is currently my mental foundation for how FileTrees operates and thus what the required features would be.

I do agree that Dagger maybe is not necessary for the core of FileTrees for many users. Implementing a more basic chainable, function call-based DAG which can be built and all at once evaluated would probably suffice for 95% of needs, it seems. Still, if it's not too much of a burden, and some of Dagger's other features can prove useful for users, I think it could make sense to keep the maintenance burden of supporting such a runtime system on Dagger, since that's what Dagger tries to do best :smile:

If we do agree that keeping Dagger as the FileTrees core is a reasonable idea, then let me outline some of the features that I personally see as being key to FileTrees' utility to users (whether they're currently used or might be used in the future):

Lazy-load (on-demand) of file contents in arbitrary formats. Status: Exists as Dagger.File, but requires additional support for easily defining custom file loading code (such as the example of DataFrame(CSV.File(file)))
Lazy-unload (on-demand) of file contents when unused, to manage memory usage. Status: Exists, same issue as above
Lazy chaining of computations. Status: Exists but will be removed; however, could be simultaneously re-implemented to work on top of the eager API, with (nearly) the same API as delayed.
Task parallelism for computations, to hide latency of I/O and allow faster load/save. Status: Exists and is well-tested across threads and workers.
Distribution of data, to reduce per-node memory overhead and process data close to where it lives. Status: Exists and is well-tested, but may benefit from better ways to express and utilize information on file locality (local disk vs. NFS shared vs. HTTP, etc.)
Checkpoint/restore of DAG state and intermediate data, in case computations fail sporadically (OOM, bad file permissions, etc.). Status: Exists but is poorly tested, and needs to be made more robust to failures and have reliable retry logic.
Streaming processing of file and intermediate data, when file data is large. Status: To be merged by JuliaCon, but needs some docs+examples, and needs integrations with Dagger.File to be convenient.

DrChainsaw commented 1 month ago

Awesome! I'm happy that you think the use case is worthwhile to have in Dagger. The main reason I started this issue was just because I'm also aware of much much of a maintenance burden the current lazy API is. I can see how adding some lazyness on top of the eager API could be much easier.

Let me know if this sounds incorrect - this is currently my mental foundation for how FileTrees operates and thus what the required features would be.

I think it is pretty much correct, except that load does not work on a lazy "transformed" tree as the tree does not remember the transformations done to it. Instead, it is exec which is the thing that kicks off whatever happened in the user provided load function and all applied transformations (save works of course as it basically does exec and then saves the result).

I agree with the entire list above, and I do make use of the other stuff from time to time. The only thing is the lazy API.

Minor question: What is the reason that Dagger.File has both a deserialize and a serialize function? If the idea is that Dagger shall use them when storing temporarily to disk it will break with alot of my data as the original source is a read-only format (i.e no serialization method exists).

Another minor comment is that I often find that threads slow things down since the tasks are memory bound so I often run with --threads=1 or just use the non-lazy API of FileTrees if I need threads for other things, but I guess future versions of Dagger will try to figure this out? I suppose one can also provide a context with a single thread in the meantime?

shashi commented 3 weeks ago

I would not want to use a Dagger.File here. A philosophy of FileTrees.load is it just takes a path and returns anything it likes--the most parameterized thing imaginable.

I am a little bit saddened that Dagger's original goal of being out-of-core and necessarily lazy for that reason now needs to be rediscovered and added back as a nice-to-have. It might just make sense to have a smaller package that does this well. (I would look into @tanmaykm's scheduler from 2018 since that was written with the lazy graph in mind and based on the best research available then. See more here https://www.youtube.com/watch?v=2G4ptA5J1bk)

But to begin with a simple work-stealing scheduler would be good enough for most workloads FileTrees can run on.

jpsamaroo commented 3 weeks ago

I would not want to use a Dagger.File here. A philosophy of FileTrees.load is it just takes a path and returns anything it likes--the most parameterized thing imaginable.

Can you clarify what you mean by this, maybe with an example? If you mean that you can just pass a path and Dagger would return an object of the appropriate data type, you could always use FileIO.load as the deserialize callback to accomplish this.

I am a little bit saddened that Dagger's original goal of being out-of-core...

Have you tried the new out-of-core support in Dagger? It's provided via MemPool and implements a tunable LRU/MRU strategy, and allows working with data of any type. Any Dagger Chunk that can use this automatically to support seamless out-of-core support, just by calling Dagger.enable_disk_caching! once at the top of an application. Maybe this doesn't exactly mesh with FileTrees' idea of out-of-core, but I'm sure with some work this can be reconciled nicely.

and necessarily lazy for that reason now needs to be rediscovered and added back as a nice-to-have.

It's not ideal, that's true, but Dagger's core APIs are basically maintained by only me, and my focus has been on supporting APIs that compose with Julia's inherently sequential, eager interfaces. If you wanted to help me figure out how to rebuild delayed (or an equivalent API) on top of the eager API, while not breaking existing code, that would be great! It's also on my long-term TODO list to re-introduce delayed, but with a slightly more uniform and intuitive API that matches spawn/@spawn in functionality.

I would look into @tanmaykm's scheduler from 2018...

I also wish that this scheduler would have made its way into Dagger's core codebase. At the time, it seemed like that was the goal, but it didn't appear that anyone was actively working on either integrating it into Dagger's core, or getting the ecosystem interested in using it (to help spur development). In the absence of being able to understand it myself, I had to basically implement things from scratch within Dagger's core. In the end, we now have a quite capable scheduler built-in to Dagger, and I have a large overhaul planned that will make it ever more programmable. I also want to improve its reliability and robustness in the face of various failures and stalls, but I will need a lot of help from the ecosystem to find use cases (with locally reproducible code) to help drive that development.