Closed liutiming closed 3 years ago
This is hard to do in the general case. Currently, if you double-click a node in the graph, all the downstream nodes collapse into a cluster. That may or may not be useful. We might be able to customize the behavior with tar_visnetwork() %>% visNetwork::visOptions(...) %>% visnetwork::visInteraction(...)
or something similar. That probably deserves a look.
Unfortunately, I do not see a good general way to automatically choose clusters beforehand, at least not that would make sense to build into targets
directly. Automatic clusters are another feature I tried in drake but found to be a lot of work with little payoff. Some user-side workarounds:
allow
and exclude
arguments of tar_visnetwork()
.values
in tar_map()
and tar_eval()
just for graphing purposes.It is good to know the clicking trick! I think it will be useful for small projects, but given the current visnetwork
, a click can be quite confusing because the nodes will then rearrange themselves and there is no indication of which node has been folded so I had a fun time looking for it 😛
The user-end solutions can be useful, too, but for more specific targets than groups of targets?
On a related note, as I am refactoring code for target_factory, I realise that there are many branches in and out into each target and this makes abstraction much more difficult (this is why targets is so helpful in the first place!). I feel the ideal situation is that each target factory only returns one dataframe instead of a list of unrelated dataframes so I end up breaking down a lot and each target factory ends up being not very "deep". -> A Philosophy of Software Design eBook: Ousterhout, John: Amazon.co.uk: Kindle Store. I realise at this point that perhaps targets themselves should be deep and use more functions in each target so that there is less dynamic branching required.
Building deep functions/targets though can be quite challenging in data science because the heterogeneity of input can result in some unexpected bugs. I feel it is necessary to include tests for the input before apply the analytical pipelines (i.e. target factories) on the input.
This is getting a bit philosophical and can be quite challenging for many "data scientist", as their performance is measured on the data analyzed and not how extensible their programs are. There is no wonder why many people just choose to copy-paste and pay the technical debt later on.
Anyways as a rule of thumb I feel people should try to "deepen" their targets first and test them out before they write target factories
I feel the development and design of targets is as much a technical challenge as an art - a good tool puts users in the right mindset and encourages good behavior. Perhaps it will be good to use a community platform (will github discussion work?) where users can discuss the best practices of using targets. Some topics that we have discussed include 1. static naming 2. git 3. design of individual targets
I was hoping this will include more people in the discussion and you do not have to monitor the conversation as frequently as you do with issues! (I have really enjoyed the discussion so far though)
Yihui's blog which somewhat related to the data science vs software engineering discussion and may be a bit contradictory to the philosophy of targetopia?
What is so usable about targets is that the user can apply it directly to their existing pipeline without generalizing the functions (or even writing any functions at all). But as data becomes more complex, the need for abstraction starts to rise again...
It is good to know the clicking trick! I think it will be useful for small projects, but given the current visnetwork, a click can be quite confusing because the nodes will then rearrange themselves and there is no indication of which node has been folded so I had a fun time looking for it
Fair point, I don't really find myself clicking to collapse all that much either. I now see visOptions()
does allow some customization here, but it is apparently still in dev, and the docs do not explain how to select which nodes get collapsed. I would like it to be a neighborhood of order 1, but the package defaults to all downstream nodes.
Graph manipulation in base targets
is going to require user-side workarounds. If there are enough ideas, technologies, and enthusiasm, I or someone else may develop a package on top of tar_network()
with more advanced graphs. But that gets a bit involved for targets
itself.
On a related note, as I am refactoring code for target_factory, I realise that there are many branches in and out into each target and this makes abstraction much more difficult (this is why targets is so helpful in the first place!). I feel the ideal situation is that each target factory only returns one dataframe instead of a list of unrelated dataframes so I end up breaking down a lot and each target factory ends up being not very "deep". -> A Philosophy of Software Design eBook: Ousterhout, John: Amazon.co.uk: Kindle Store. I realise at this point that perhaps targets themselves should be deep and use more functions in each target so that there is less dynamic branching required.
It's an art. Sometimes, a target factory abstracts away an entire simulation study, as with stantargets::tar_stan_mcmc_rep_summary()
: https://wlandau.github.io/stantargets/articles/mcmc_rep.html. Other times, there are few targets to create, but the proper command would otherwise be difficult. For example, tarchetypes::tar_render()
constructs an elaborate expression object to declare dependencies, but users only need to supply the report itself. If no hard patterns emerge in your case, then my advice might not apply to you.
Building deep functions/targets though can be quite challenging in data science because the heterogeneity of input can result in some unexpected bugs. I feel it is necessary to include tests for the input before apply the analytical pipelines (i.e. target factories) on the input.
Yeah, it takes both careful of engineering and an understanding of domain knowledge to construct factories. Here, I describe the minimum tests I think are necessary: command/pattern construction (tar_manfiest()
), dependency relationships (edges from tar_network()
), return values, and invalidation behavior. Target factories are ideally implemented in packages for this and other reasons.
This is getting a bit philosophical and can be quite challenging for many "data scientist", as their performance is measured on the data analyzed and not how extensible their programs are. There is no wonder why many people just choose to copy-paste and pay the technical debt later on.
I am trying to tackle the most common patterns in data science first, particularly Bayesian statistics because of my background. Hopefully a small handful of R Targetopia packages will cover a large number of use cases. In most other cases, it will probably take someone who is equal parts developer and data scientist, and the end product will hopefully benefit a whole team of non-developer data scientists. But still, I think we are getting closer to democratizing pipelines.
Anyways as a rule of thumb I feel people should try to "deepen" their targets first and test them out before they write target factories
I agree. For direct users of targets
, Ousterhout "deep modules" are custom functions. I think we can expect more users to write functions than to write factories.
I feel the development and design of targets is as much a technical challenge as an art - a good tool puts users in the right mindset and encourages good behavior.
You are so right! Coming from drake
, this is exactly what I am trying to do with targets
.
Perhaps it will be good to use a community platform (will github discussion work?) where users can discuss the best practices of using targets. Some topics that we have discussed include 1. static naming 2. git 3. design of individual targets I was hoping this will include more people in the discussion and you do not have to monitor the conversation as frequently as you do with issues! (I have really enjoyed the discussion so far though)
I did not know about GitHub Discussions, and they sound like a great fit. I will explore this.
Yihui's blog which somewhat related to the data science vs software engineering discussion and may be a bit contradictory to the philosophy of targetopia?
I think literate programming and targets
are designed for different situations. Literate programming works best when you don't actually need to write much code and the maintainer burden is low. Computationally intense fields like Bayesian statistics and machine learning are different because they force us to confront software engineering problems, whether we like it or not. I have always recommended functions as a bare minimum, but that's hard to democratize. With the R Targetopia, I am trying to reverse that: still create sophisticated pipelines, but reduce the amount of code and software engineering required.
It is possible to misappropriate a technique in either direction: either overengineer a simple analysis or use literate programming for a massive simulation study. I find that people make the latter mistake far more than the former.
I had a look at GitHub Discussions, and I really think it is going to separate user issues from development issues, which will be huge. Thanks so much for suggesting it.
My pleasure! Thanks so much for the insightful replies, too! I will respond to one in our newly created Discussion section and ruminate over the rest!
Currently I have list of targets that are distinct from one another and I want to understand their relationships better. I understand list info will not be retained, but it may be nice to visualise directly in targets on a more macro scale. Currently I have 105 targets in my project and there are more in the so it can be a bit hard to make sense from the
visnetwork
result.Is it a design issue? Currently I am using one function for each target, but even if I include multiple functions in one target they will still stay on the
visnetwork
graph so I am not sure if that helps...