tidyverse / magrittr

Improve the readability of R code with the pipe
https://magrittr.tidyverse.org
Other
957 stars 157 forks source link

Hartmann pipelines #258

Closed IanWorthington closed 2 years ago

IanWorthington commented 2 years ago

Some implementations of pipes (such as the version developed by John Hartmann) allow multiple input and output streams for each pipeline stage (see "Hartmann Pipelines", https://en-academic.com/dic.nsf/enwiki/364520), allowing us to code filters which, for example, can process both the found and not found records within a single pipe. In Hartmann's syntax this could look like:

read input.txt | A: locate /Hello/ | write found.txt ; A: | write notfound.txt

Here the records that pass the locate filter stage are written to the main output stream, but those rejected are written to a secondary stream (called A here) and passed to a separate write stage.

I have found this very useful when working with large datasets and is something I miss in R, having to constantly reprocess data to filter different elements.

Is there something similar available in magrittr that I've missed that implements the same ideas?

lionel- commented 2 years ago

I'm not sure how this idea would be applicable here. The pipe passes data structures, not streams. Maybe look into destructuring assignments e.g. https://github.com/r-lib/zeallot.

IanWorthington commented 2 years ago

I'm not sure how this idea would be applicable here. The pipe passes data structures, not streams. Maybe look into destructuring assignments e.g. https://github.com/r-lib/zeallot.

Hi @lionel --

Thanks for your reply.

My use case here would be for stages such as dplyr::filter() so that, as is the example above, I could capture the data rejected by a filter() and process it without having to refilter the original data frame. As my data frames tend to be quite large this would be a considerable time saver.

lionel- commented 2 years ago

This would imply creating a variant of filter() that returns a list of two data frames. Note that this is fully independent of magrittr or a pipe operator, it's all about the data structure. You can use vctrs::vec_split() to achieve this with a non-dplyr interface.

I'm closing this issue because this is out of scope for magrittr, but thanks for the suggestion.