ropensci / targets

Function-oriented Make-like declarative workflows for R
https://docs.ropensci.org/targets/
Other
928 stars 73 forks source link

Visualise groups of targets (e.g. target factories?) #282

Closed liutiming closed 3 years ago

liutiming commented 3 years ago

Currently I have list of targets that are distinct from one another and I want to understand their relationships better. I understand list info will not be retained, but it may be nice to visualise directly in targets on a more macro scale. Currently I have 105 targets in my project and there are more in the so it can be a bit hard to make sense from the visnetwork result.

Is it a design issue? Currently I am using one function for each target, but even if I include multiple functions in one target they will still stay on the visnetwork graph so I am not sure if that helps...

wlandau commented 3 years ago

This is hard to do in the general case. Currently, if you double-click a node in the graph, all the downstream nodes collapse into a cluster. That may or may not be useful. We might be able to customize the behavior with tar_visnetwork() %>% visNetwork::visOptions(...) %>% visnetwork::visInteraction(...) or something similar. That probably deserves a look.

Unfortunately, I do not see a good general way to automatically choose clusters beforehand, at least not that would make sense to build into targets directly. Automatic clusters are another feature I tried in drake but found to be a lot of work with little payoff. Some user-side workarounds:

liutiming commented 3 years ago

image

It is good to know the clicking trick! I think it will be useful for small projects, but given the current visnetwork, a click can be quite confusing because the nodes will then rearrange themselves and there is no indication of which node has been folded so I had a fun time looking for it 😛

The user-end solutions can be useful, too, but for more specific targets than groups of targets?

liutiming commented 3 years ago

On a related note, as I am refactoring code for target_factory, I realise that there are many branches in and out into each target and this makes abstraction much more difficult (this is why targets is so helpful in the first place!). I feel the ideal situation is that each target factory only returns one dataframe instead of a list of unrelated dataframes so I end up breaking down a lot and each target factory ends up being not very "deep". -> A Philosophy of Software Design eBook: Ousterhout, John: Amazon.co.uk: Kindle Store. I realise at this point that perhaps targets themselves should be deep and use more functions in each target so that there is less dynamic branching required.

Building deep functions/targets though can be quite challenging in data science because the heterogeneity of input can result in some unexpected bugs. I feel it is necessary to include tests for the input before apply the analytical pipelines (i.e. target factories) on the input.

This is getting a bit philosophical and can be quite challenging for many "data scientist", as their performance is measured on the data analyzed and not how extensible their programs are. There is no wonder why many people just choose to copy-paste and pay the technical debt later on.

liutiming commented 3 years ago

Anyways as a rule of thumb I feel people should try to "deepen" their targets first and test them out before they write target factories

liutiming commented 3 years ago

I feel the development and design of targets is as much a technical challenge as an art - a good tool puts users in the right mindset and encourages good behavior. Perhaps it will be good to use a community platform (will github discussion work?) where users can discuss the best practices of using targets. Some topics that we have discussed include 1. static naming 2. git 3. design of individual targets

I was hoping this will include more people in the discussion and you do not have to monitor the conversation as frequently as you do with issues! (I have really enjoyed the discussion so far though)

liutiming commented 3 years ago

https://yihui.org/en/2018/09/notebook-war/#the-two-cultures-the-r-vs-python-culture-or-data-analysis-vs-software-engineering-culture

Yihui's blog which somewhat related to the data science vs software engineering discussion and may be a bit contradictory to the philosophy of targetopia?

What is so usable about targets is that the user can apply it directly to their existing pipeline without generalizing the functions (or even writing any functions at all). But as data becomes more complex, the need for abstraction starts to rise again...

wlandau commented 3 years ago

It is good to know the clicking trick! I think it will be useful for small projects, but given the current visnetwork, a click can be quite confusing because the nodes will then rearrange themselves and there is no indication of which node has been folded so I had a fun time looking for it

Fair point, I don't really find myself clicking to collapse all that much either. I now see visOptions() does allow some customization here, but it is apparently still in dev, and the docs do not explain how to select which nodes get collapsed. I would like it to be a neighborhood of order 1, but the package defaults to all downstream nodes.

wlandau commented 3 years ago

Graph manipulation in base targets is going to require user-side workarounds. If there are enough ideas, technologies, and enthusiasm, I or someone else may develop a package on top of tar_network() with more advanced graphs. But that gets a bit involved for targets itself.

wlandau commented 3 years ago

On a related note, as I am refactoring code for target_factory, I realise that there are many branches in and out into each target and this makes abstraction much more difficult (this is why targets is so helpful in the first place!). I feel the ideal situation is that each target factory only returns one dataframe instead of a list of unrelated dataframes so I end up breaking down a lot and each target factory ends up being not very "deep". -> A Philosophy of Software Design eBook: Ousterhout, John: Amazon.co.uk: Kindle Store. I realise at this point that perhaps targets themselves should be deep and use more functions in each target so that there is less dynamic branching required.

It's an art. Sometimes, a target factory abstracts away an entire simulation study, as with stantargets::tar_stan_mcmc_rep_summary(): https://wlandau.github.io/stantargets/articles/mcmc_rep.html. Other times, there are few targets to create, but the proper command would otherwise be difficult. For example, tarchetypes::tar_render() constructs an elaborate expression object to declare dependencies, but users only need to supply the report itself. If no hard patterns emerge in your case, then my advice might not apply to you.

Building deep functions/targets though can be quite challenging in data science because the heterogeneity of input can result in some unexpected bugs. I feel it is necessary to include tests for the input before apply the analytical pipelines (i.e. target factories) on the input.

Yeah, it takes both careful of engineering and an understanding of domain knowledge to construct factories. Here, I describe the minimum tests I think are necessary: command/pattern construction (tar_manfiest()), dependency relationships (edges from tar_network()), return values, and invalidation behavior. Target factories are ideally implemented in packages for this and other reasons.

wlandau commented 3 years ago

This is getting a bit philosophical and can be quite challenging for many "data scientist", as their performance is measured on the data analyzed and not how extensible their programs are. There is no wonder why many people just choose to copy-paste and pay the technical debt later on.

I am trying to tackle the most common patterns in data science first, particularly Bayesian statistics because of my background. Hopefully a small handful of R Targetopia packages will cover a large number of use cases. In most other cases, it will probably take someone who is equal parts developer and data scientist, and the end product will hopefully benefit a whole team of non-developer data scientists. But still, I think we are getting closer to democratizing pipelines.

wlandau commented 3 years ago

Anyways as a rule of thumb I feel people should try to "deepen" their targets first and test them out before they write target factories

I agree. For direct users of targets, Ousterhout "deep modules" are custom functions. I think we can expect more users to write functions than to write factories.

wlandau commented 3 years ago

I feel the development and design of targets is as much a technical challenge as an art - a good tool puts users in the right mindset and encourages good behavior.

You are so right! Coming from drake, this is exactly what I am trying to do with targets.

Perhaps it will be good to use a community platform (will github discussion work?) where users can discuss the best practices of using targets. Some topics that we have discussed include 1. static naming 2. git 3. design of individual targets I was hoping this will include more people in the discussion and you do not have to monitor the conversation as frequently as you do with issues! (I have really enjoyed the discussion so far though)

I did not know about GitHub Discussions, and they sound like a great fit. I will explore this.

wlandau commented 3 years ago

Yihui's blog which somewhat related to the data science vs software engineering discussion and may be a bit contradictory to the philosophy of targetopia?

I think literate programming and targets are designed for different situations. Literate programming works best when you don't actually need to write much code and the maintainer burden is low. Computationally intense fields like Bayesian statistics and machine learning are different because they force us to confront software engineering problems, whether we like it or not. I have always recommended functions as a bare minimum, but that's hard to democratize. With the R Targetopia, I am trying to reverse that: still create sophisticated pipelines, but reduce the amount of code and software engineering required.

It is possible to misappropriate a technique in either direction: either overengineer a simple analysis or use literate programming for a massive simulation study. I find that people make the latter mistake far more than the former.

wlandau commented 3 years ago

I had a look at GitHub Discussions, and I really think it is going to separate user issues from development issues, which will be huge. Thanks so much for suggesting it.

liutiming commented 3 years ago

My pleasure! Thanks so much for the insightful replies, too! I will respond to one in our newly created Discussion section and ruminate over the rest!