moodymudskipper / inops

Infix Operators for Detection, Subsetting and Replacement
GNU General Public License v3.0
40 stars 0 forks source link

names #6

Closed moodymudskipper closed 4 years ago

moodymudskipper commented 4 years ago

all talk about naming conventions can happen here.


I chose %subset***% to extract matching subset, maybe there's better.


I like %!in% better than %out% because :


I picked rangeops as a package name because comparisons separate ranges of values as well so it seemed to make sense, open to any alternative. My previous package mmassign was more about having all kinds of assignment operators, but it's probably better to be more focused.

karoliskoncevicius commented 4 years ago

I really like your suggestions about %!in% instead of %out% - as you say it's more generalizable. At least we can probably consider that as decided :)

The package name - I am not too sure about. Personally for me it doesn't make much difference. When I am using a package I just remember it's name. But if we want to have more users then something "catchy" might be favourable.

A list of my suggested names so far:

%in{}%  %!in{}%
%in[]%  %!in[]%
%in()%  %!in()%
%in(]%  %!in(]%
%in[)%  %!in[)%

The names I am less sure about:

%#in{}% %#!in{}%
# and all the rest with %#in% for working on values that occur some number of times.

The names I am even less sure about:

%in{.}%
# for extracting the value itself, I think you used %vin{}% for that?

NOTE: I am not sure if overwriting the default %in% would be a good idea - users might find that after loading the package their older codes break. So my proposal would be to use %in{}% for expanding on %in% ( {} brackets denoting "set" as in math).

As always - just an opinion and comments welcome.

moodymudskipper commented 4 years ago

I'm all for a better package name, I just have no better idea.


%in{}% is a bit confusing as a name, {} usually describes sets as you say, and in that case it kind of means "apply".

Also iris %in{}% "setosa" is basically map_dfr(iris,%in%, "setosa") so it seems to me it really makes sense only if it's used a lot. Could you describe some use cases and the added value it has for you?

It also begs the question why we wouldn't have element wise operators for all our other operators, to be consistent, which is a genuine possibility but we need clear naming conventions, like maybe :

I don't think we should overload %in% either, did you think I imply we should ?

The package so far is only overloading <<- so foo < bar <- value works, but keeps its original binary use working as before.


re :

%in{.}% for extracting the value itself, I think you used %vin{}% for that?

Yes I did in reddit, and but in this package I used a %subset***% form, so you can have %subset{}% but also %subset>=%.

In this case the {} could make sense as it's really about sets, but we have a lot of special characters already, the possibilities would be :

These may not be very readable nor easy to type, and there's the question of where the ! would fit that might lead to frustrations, though I think the last one is not that bad and is generalizable to %{}>=% etc.

We might need to keep ideas coming and sleep on them a few times.


I'm fine with a set of # operators, is the main use case you're thinking about to aggregate rare values before modelling ?

Some alternatives starting with your proposal :

karoliskoncevicius commented 4 years ago

About %in{}% I think I didn't convey the purpose of it...

In my view %in{}% would be the same as %in%, except would handle all the special cases we add. I am thinking about it this way (left side is notation in math):

x in {"a", "b"}    # %in{}%
x in [a:b]         # %in[]%
x in (a:b)         # %in()%

So x %in{}% A would be "is element x in set A" and would be the same as %in%.

In other words - {} is a variant of interval notation, not a new notation for additional operators. The main purpose is: 1. to consistently specify the type of interval after %in and 2. to not overload %in%.

karoliskoncevicius commented 4 years ago

Regarding the subset names, I think I agree with you about having a separate verb to separate those. But %subset% seems a bit long to type.

How do you feel about these:

letters %get[]% c("a", "c")
letters %sub[]% c("a", "c")

?

moodymudskipper commented 4 years ago

subset might be long, but base::sub() is used to replace and get doesn't convey the right meaning in my opinion.

about %in{}% I think I get it better now thanks.

In your current package infixer iris %in{}% "setosa" returns something different than iris %in% "setosa", and iris %in[]% "setosa".

If this is one of the additional cases that you mention, then it is so far inconsistent with the other functions so if we keepthis behavior I believe all in functions should return a list/data.frame when applied on a list/data.frame.

karoliskoncevicius commented 4 years ago

infixer will be deleted soon I think. I never got around to polishing it.

If some functions there do not work properly on matrices - then that's an oversight. The intention was for all those operators to return a matrix output, when the input is a matrix... Like %in{}% did.

moodymudskipper commented 4 years ago

got it!

karoliskoncevicius commented 4 years ago

Inquiry: how do you feel about removing the "in" part from the function names?

%{}%, %[]%, %()%

%!{}%, %![]%, %!()%

?

karoliskoncevicius commented 4 years ago

I am not sure about this myself. I think if we will later add some more functionality that still has the interval symbols, but replaces the "in" word - then in should be left in. But otherwise not sure...

karoliskoncevicius commented 4 years ago

Also, can we find alternative name for %like%?

%~% and %!~%? Or would be too cryptic?

moodymudskipper commented 4 years ago

I had missed this batch of messages!

I like these short names but I am afraid that removing the in part might make it confusing as it would be similar but different to what the functions do in the package you got inspiration from, was this package prominent ?

I would also have liked these names for the subsetting versions, except then you cannot generalize them for comparison operators.

about renaming %like%, it does need a new name as I mentioned in comments to your commit as to be consistent we need to make it different from data.table::`%like%`, I think %like{}% would be nice, and would be to data.table::`%like%` what %in{}% is to %in%.

%~% does look good and is clever but it's probably been already used, possibly by a prominent package, and I see it as more potentially confusing than something more explicit.

About the necessity of subsetting variants, we can indeed wait to see if we really miss them, I know that I would miss them for the like variant, but for intervals I'll leave it to you because I'm not likely to use those that much tbh :).

karoliskoncevicius commented 4 years ago

Hmm thinking thinking.

How about %~in{}% ? ~ would specify that we are doing same as %in{}% except with regex matching. The syntax of using ~ I think is quite universal. If I remember correctly - it is used in perl. This would also allow us to include similar notation for selecting elements that occur specified number of times: %#in{}% and %#in[]%.

What do you think?

moodymudskipper commented 4 years ago

I think I like it. I didn't know about ~ being used for regex, if it is then it makes a lot of sense.

karoliskoncevicius commented 4 years ago

Well at least I remember it from my perl days :)

https://perldoc.perl.org/perlretut.html#Part-1%3a-The-basics

We can use that syntax if you like it. I think it would make sense. But maybe a bit cumbersome to write down compared to %like%.

karoliskoncevicius commented 4 years ago

What do you think about %[...]% syntax for substitute?

So for example: %[in{}]%. Seems similar to long form:

x[x %in{}% c("a", "b")]
x %[in{}]% c("a", "b")
moodymudskipper commented 4 years ago

It makes sense, if we consider that this is readable enough:

x %[in[]]% c("a", "b")
x %[in(]]% c("a", "b")
x %[in[)]% c("a", "b")
x %[==]% c("a", "b")
x %[>]% c("a", "b")

I like that it's straightforward to generalize and that it looks good with comparison ops.

some alternative i have or had thought about :

x %subset[]% c("a", "b")
x %subset(]% c("a", "b")
x %subset==% c("a", "b")
x %subset>% c("a", "b")

or

x %vin[]% c("a", "b")
x %vin(]% c("a", "b")
x %vin[)% c("a", "b")
x %v==% c("a", "b")
x %v>% c("a", "b")

or

x %value[]% c("a", "b")
x %value(]% c("a", "b")
x %value[)% c("a", "b")
x %value==% c("a", "b")
x %value>% c("a", "b")

or

x %val[]% c("a", "b")
x %val(]% c("a", "b")
x %val[)% c("a", "b")
x %val==% c("a", "b")
x %val>% c("a", "b")
karoliskoncevicius commented 4 years ago

Suggestion: use %in~% instead of %~in{}% for %like%. I think it's more consistent. The argument after "in" specifies the type of operation, while the argument before "in" should specify the transformation (if any) before doing the operation (like %#in{}% - table before doing %in{}%).

A lot easier to write as well.

karoliskoncevicius commented 4 years ago

We also have %in''% and %in""% that could be used for something (maybe even regex?)

karoliskoncevicius commented 4 years ago

How about %in""% for gsub() and %in''% for gsub(fixed=TRUE) ?

karoliskoncevicius commented 4 years ago

Still thinking about subsetting... If we will use %#in{}% for tables then I would not use %vin% - as the symbol before the in part would have two distinct meanings.

EDIT: I take that back. %!in{}% is already a second meaning. So maybe we can find alternative for # instead.

moodymudskipper commented 4 years ago

How would we implement gsub ? I thought we were onl wrapping grepl.

I also don't understand the part about second meanings. I think %#!in{}% is a good name.

I like %in~% , its friends would be %!in~%, %[in~]% and %[!in~]%, which all seem fairly readable to me.

karoliskoncevicius commented 4 years ago

Sorry, I meant grep(), not gsub().

%in~% seems nice to me, so agree. Thou in some cases it might be convenient to have case-insensitive variant, or fixed=TRUE variant, don't you think?

Regarding the meanings rigmarole - ignore that for now, still thinking how it all adds up. But the main idea is to define "operators" on the right side of in and modifications of those operators on the left, and be consistent with it. Then if we use v for subset - we would use two modifications in the case of %#vin{}% which might be confusing.

karoliskoncevicius commented 4 years ago

To elaborate more on this: we have 3 placeholders:

It would be nice if everything we add here could fit in these 3, without using two rhs at the same time. Thou we can probably consider !in a word and get away with it. Thou if we later add something like %which{}% or %length{}% would ! still be convenient: %!length{}%?

moodymudskipper commented 4 years ago

This will be a nice use case for %[<]% : detect potential categorical variables in your dataset :

map_dbl(data, n_distinct) %[<=]% 20

rather than :

counts <- map_dbl(data, n_distinct) 
counts[counts <= 20]
moodymudskipper commented 4 years ago

I've been also thinking about %startsWith% which is like startsWith() but consistent with our other functions when applied on data frames (i.e. returning a matrix of logical). Would come along with %startsWith%<-, %[startsWith]%, %#startsWith%. Same for %endsWith%.

moodymudskipper commented 4 years ago

About variants of %in~%, maybe %in~f% for fixed = TRUE, and %in~p% for perl=TRUE ?

data.table has (in dev version) %plike%, %flike%, and %ilike% to return numeric indices, but f and p have orthogonal uses while i makes sense with any. So data.table would need confusing %iplike% or %pilike% etc to be general, while our approach would be unambiguous and more readable because we'd have %#in~% (wrapped aroud default grepl), %#in~f% (fixed = TRUE), and %#in~p% (perl = TRUE).

moodymudskipper commented 4 years ago

Ah no I think you want # for counts, not for which().

How much do you need # for counts ? I feel that it's more useful to have a shortcut for which (to get numerical indices), than a shortcut for sum (to count), though maybe we can be creative enough to have both.

karoliskoncevicius commented 4 years ago

We can probably have both.

Regarding this:

This will be a nice use case for %[<]% : detect potential categorical variables in your dataset :

map_dbl(data, n_distinct) %[<=]% 20

This is actually a potential scenario for #:

data %#<=% 20

But probably will have to be done column-by-column in case of data.frames.

Regarding which - I almost never use it. But I think we can find a way to incorporate it if we want. Just have to think about the syntax. It feels to me like this package is quite simply in terms of functionality, so clever and convenient naming scheme is paramount to have. Worth spending some time thinking about how to name stuff.

With regards to in~ being grep - if we agree with this I can send a pull req changing the %like% to %in~%. Maybe startsWith() and endsWith() can also have a more convenient form? %in^~%, %in$~% or something of that sort?

Also with regards to %[in]% - will we use this for subsetting? Or do you think we can find a nicer alternative.

One additional thought that I want to run by you is - if we use %[in]% to subset, we can probably also add %|in|% for number of elements satisfying the match. i.e.:

if((x %|<|% 0) == 0)
  print("all elements are positive")

Thou maybe it's excessive a bit.

moodymudskipper commented 4 years ago

I will think more about startsWith but let's forget it for now, probably I spoke too fast and it's not needed as we could just use "^foo" on the rhs with %in~% (the nuance is that startsWith is fixed, but that might not add that much value).


I was still confused by what you meant by %#in%, now I think I get it, and I wonder if it shouldn't be just %#%, see end of post.

I also wonder if %in~% shouldn't be just %~%, actually Romain François has it implemented in his package operators (though it doesn't comply to our standards regarding data frames) :

https://cran.r-project.org/web/packages/operators/operators.pdf

Romain François has good ideas too, for instance he uses a * suffix to wrap in all, and a + option to wrap in any.

One additional thought that I want to run by you is - if we use %[in]% to subset, we can probably also add %|in|% for number of elements satisfying the match

I like it, and it would make sense if we keep the %[foo]% subsetting.

A spontaneous idea though, would it make sense to have only the right side bracket : %[in% or does it look weird ? we'd be consistent with a [output_type][negation][operation_type][option] , and it wouldn't overcharge the right side as in %[in[]]%.

11:13 %[!=% 12
#> [1] 11 13

11:13 %@!=% 12
#> [1] 1 3

11:13 %+==% 12
#> 1

c(11, 11, 12, 13, 13, 13) %#% 2
#> [1] TRUE TRUE FALSE TRUE TRUE TRUE

c(11, 11, 12, 13, 13, 13) %[#% 2
#> [1] 11 11 13 13 13

These would be paired consistently with [negation][operation_type][option]<- operators.


Additional possible output types:

11:13 %/>% 12
#> [1] 0.3333333

11:13 %?>% 12
#> [1] TRUE

11:13 %*>% 12
#> [1] FALSE

c(11, 11, 12, 13, 13, 13) %[[#% 2
#> [1] 11 13
karoliskoncevicius commented 4 years ago

I would def agree with %~%. I suggested the same 20 days ago, but at the time you we agreed that it might overshadow some defined operator in a popular package and introduce conflicts.

I like the @ for indices. I think it's quite logical and makes sense.

But the problem with %#% is that in this case we wouldn't be able to do something like "select all factors than occur exactly 3 number of times". It would always be "less then"...

karoliskoncevicius commented 4 years ago

Overall, I like almost all of your suggestions actually. Maybe not [[ - because it's quite non-intuitive. However with the pattern you are proposing, I think it would make sense even to drop in. Because I can see little justification for only using it with ranges.

karoliskoncevicius commented 4 years ago

Overall I think we can go in several different directions. Two of which I see as the following:

Using the words to specify operations.

  1. %in()% = get logical indices
  2. %get()% for subsetting
  3. Something like %ind<% or %which<% to get numeric indices
  4. %any<% and %all<% for existence/every checks.
  5. %tab<% or %table<% - a possible replacement for #.

etc. Then we should always keep the word (even in %in~% - to be consistent).

Using only symbols to specify operations.

  1. %{}%, %[)%, %!<% etc. for logical indices.
  2. %[<% for subsetting. Thou %[[]% would look a bit weird.
  3. %@{}% etc. for numeric indices.
  4. %?{}% and `%*{}% for existence/every.
  5. %#<% for tables.
karoliskoncevicius commented 4 years ago

One correction to the above: just now remembered that %tab% is a different beast. We would need to provide all the others like %get%, %ind%, etc for %tab% as well.

But I only have %#in()% and %in#()% for now.

moodymudskipper commented 4 years ago

I would def agree with %~%. I suggested the same 20 days ago, but at the time you we agreed that it might overshadow some defined operator in a popular package and introduce conflicts.

oops :). Yes I still think it's a bit annoying, but they do the same most of the time and our package can replace it completely so I think it might be not that bad. %{}% etc are a bit more confusing because they don't do the same, but if our package covers all related functionalities users shouldn't attach them at the same time anyway I guess...

[[ doesn't look good indeed, I was thinking that something for unique would be nice and just picked whatever in the moment.

I thing %get()% is too close to get() which is very different, in that case %subset()% would be worth the two additional characters in my opinion, but it seems you don't like this one much :).

You're right about the 2 possible directions I think, I'll need to sleep on it a few times as I have no strong opinion for now, we could also have aliases while we test the package and see what feels right.

karoliskoncevicius commented 4 years ago

I'll need to sleep on it a few times as I have no strong opinion for now.

Yup please do! I am leaning a bit more towards the words now, for some reason. I played a bit with both variants and words seem easier to understand/remember and actually to write down (as I type letters faster than symbols). But I am not dead-set on it too much yet.

The %get% can be renamed of course. I would like %subset% but would prefer for it to be shorter if possible. Like %val% (as in "value") or %sub% (as in "subset" - however this would also coincide with sub())

moodymudskipper commented 4 years ago
output type words symbols description
logical == / != equality / inequality
logical > / >= / < / <= comparison
logical %in()% / %!in()% %()% / %!()% open interval
logical %in(]% / %!in(]% %(]% / %!(]% open left closed right
logical %in[)% / %!in[)% %[)% / %![)% open right closed left
logical %in[]% / %!in[]% %[]% / %![]% closed interval
logical %in{}% / %!in{}% %{}% / %!{}% generalized %in%
logical %in~% / %!in~% / %in~f% / %!in~f% / %in~p% / %!in~p% %~% / %!~% / %~f% / %!~f% / %~p% / %!~p% regex
subset %subset==% / %subset!=% %[==% / %[!=% equality / inequality
subset %subset>% / %subset>=% / %subset<% / %subset<=% %[>% / %[>=% / %[<% / %[<=% comparison
subset %subset()% / %subset!()% %[()% / %[!()% open interval
subset %subset(]% / %subset!(]% %[(]% / %[!(]% open left closed right
subset %subset[)% / %subset![)% %[[)% / %[![)% open right closed left
subset %subset[]% / %subset![]% %[[]% / %[![]% closed interval
subset %subset{}% / %subset!{}% %[{}% / %[!{}% generalized %in%
subset %subset~% / %!subset~% / %subset~f% / %!subset~f% / %subset~p% / %!subset~p% %[~% / %[!~% / %[~f% / %[!~f% / %[~p% / %[!~p% regex
numeric indices %which==% / which!=% %@==% / %@!=% equality / inequality
numeric indices %which>% / %which>=% / %which<% / %which<=% %>% / %[>=% / %[<% / %[<=% comparison
numeric indices %which()% / %which!()% %@()% / %@!()% open interval
numeric indices %which(]% / %which!(]% %@(]% / %@!(]% open left closed right
numeric indices %which[)% / %which![)% %@[)% / %@![)% open right closed left
numeric indices %which[]% / %which![]% %@[]% / %@![]% closed interval
numeric indices %which{}% / %which!{}% %@{}% / %@!{}% generalized %in%
numeric indices %which~% / %!which~% / %which~f% / %!which~f% / %which~p% / %!which~p% %@~% / %@!~% / %@~f% / %@!~f% / %@~p% / %@!~p% regex
every %all==% / all!=% %*==% / %*!=% equality / inequality
every %all>% / %all>=% / %all<% / %all<=% %*>% / %*>=% / %*<% / %*<=% comparison
every %all()% / %all!()% %*()% / %*!()% open interval
every %all(]% / %all!(]% %*(]% / %*!(]% open left closed right
every %all[)% / %all![)% %*[)% / %*![)% open right closed left
every %all[]% / %all![]% %*[]% / %*![]% closed interval
every %all{}% / %all!{}% %*{}% / %*!{}% generalized %in%
every %all~% / %!all~% / %all~f% / %!all~f% / %all~p% / %!all~p% %*~% / %*!~% / %*~f% / %*!~f% / %*~p% / %*!~p% regex
any %any==% / any!=% %?==% / %?!=% equality / inequality
any %any>% / %any>=% / %any<% / %any<=% %?>% / %?>=% / %?<% / %?<=% comparison
any %any()% / %!any()% %?()% / %?!()% open interval
any %any(]% / %!any(]% %?(]% / %?!(]% open left closed right
any %any[)% / %!any[)% %?[)% / %?![)% open right closed left
any %any[]% / %!any[]% %?[]% / %?![]% closed interval
any %any{}% / %!any{}% %?{}% / %?!{}% generalized %in%
any %any~% / %!any~% / %any~f% / %!any~f% / %any~p% / %!any~p% %?~% / %?!~% / %?~f% / %?!~f% / %?~p% / %?!~p% regex

words :

symbols :

Ambiguity in word versions due to positioning of ! (before in, after subset), can be mitigated by :

We can define aliases so we could have both versions

Maybe all and any versions should be put in the fridge and we'll see if we need them.

My current preference is :

In any case that's a lot of functions and we might need functions factories to build them to avoid huge amount of copy / pastes (will need its own issue)

karoliskoncevicius commented 4 years ago

Agree that we might need a function factory. Thou we can probably also "export" main functionality to functions non-accessible for the user.

As for the words - the long versions of words do not bother you? %subset% for example seems quite a long name. None of the shorter versions of this were appealing?

We can leave the "all" and "any" out for now. Thou personally for my case I would use these more compared to "which". I almost never use indices for subsetting, try to stay with logical.

Nice table by the way. The only thing I would change is - make the placement of ! consistent. I don't mind having it always at the front actually. Or always at the back (after the word).

moodymudskipper commented 4 years ago

I don't care that much for which either, it's just that I think they'll be straightforward while all and any might entail more discussions because !all[] is not like all![], so which do we need, we could have !all![] etc, and I think they won't be used that much so by lack of familiarity we'll end up wrapping in all anyway (just as we might do with any and which).

We can skip which as well, to move forward.

I don't like having all ! in the front because negating a subset doesn't make sense to me. This issue is made clear in the case of all as in my 1st paragraph above, !all[] and all![] mean different things and !subset[] means nothing.

I propose we go with words with ! at the back (incl %in![]% etc), and we go with %val**% to subset. Meanwhile we keep short aliases at least until the end of our test run, then we decide what we choose or if we keep aliases.

I propose we stick with the logican , subset, and replace variants, the latter being assignment versions of logical ones.

I believe the # / table versions were important to you, I didn't include them here because the expected outputs are still not clear to me, so I propose to let you experiment with them and code them as you see fit if you want, PR them and we can discuss them later.

karoliskoncevicius commented 4 years ago

I agree with everything you wrote here. Let's keep which and any and all out for now. I like val better than subset. And I agree on placing ! at the end after the word.

One thing I would like to hear your opinion on: would it not be more logical to have the replace operator work on ind (subset) instead of in? Somehow this would make more sense to me - as the operator is extracting elements, so it call also replace them. I am thinking names(x) and names(x)<- or x[2] and x[2] <- 0. i.e. replacement works on syntax that returns elements otherwise.

moodymudskipper commented 4 years ago

The thing is if we're trying to be consistent with ==, and we want to have x == 3 <- 4 , this needs to be done on logical functions.

I see it as simpler too, because x %in{}% foo <- value is a shortcut for x[x %in{}% foo] <- value

moodymudskipper commented 4 years ago

while we're on names I wanted to touch a point on argument names too, as we're moving to more consistency with ==, it might make sense to switch back to e1 and e2 (=='s arguments).

x and table are weird arguments, I know I was the one to push for them, but I wanted to be consistent with %in% and I don't think it's the right choice anymore :).

karoliskoncevicius commented 4 years ago

Here is one more syntax format we might consider (if bringing back out):

output type mixed description
logical equality / inequality
logical comparison
logical %in()% / %out()% open interval
logical %in(]% / %out(]% open left closed right
logical %in[)% / %out[)% open right closed left
logical %in[]% / %out[]% closed interval
logical %in{}% / %out{}% generalized %in%
logical %in~% / %out~% / %in~f% / %out~f% / %in~p% / %out~p% regex
subset %[==% / %[!=% equality / inequality
subset %[>% / %[>=% / %[<% / %[<=% comparison
subset %[in()% / %[out()% open interval
subset %[in(]% / %[out(]% open left closed right
subset %[in[)% / %[out[)% open right closed left
subset %[in[]% / %[out[]% closed interval
subset %[in{}% / %[out{}% generalized %in%
subset %[in~% / %[out~% / %[in~f% / %[out~f% / %[in~p% / %[out~p% regex
numeric indices %@in==% / @out=% equality / inequality
numeric indices %@in>% / %@in>=% / %@in<% / %@in<=% comparison
numeric indices %@in()% / %@out()% open interval
numeric indices %@in(]% / %@out(]% open left closed right
numeric indices %@in[)% / %@out[)% open right closed left
numeric indices %@in[]% / %@out[]% closed interval
numeric indices %@in{}% / %@out{}% generalized %in%
numeric indices %@in~% / %@out~% / %@in~f% / %@out~f% / %@in~p% / %@out~p% regex
every %*in==% / *out=% equality / inequality
every %*in>% / %*in>=% / %*in<% / %*in<=% comparison
every %*in()% / %*out open interval
every %*in(]% / %*out open left closed right
every %*in[)% / %*out open right closed left
every %*in[]% / %*out closed interval
every %*in{}% / %*out generalized %in%
every %*in~% / %*out~% / %*in~f% / %*out~f% / %*in~p% / %*out~p% regex
any %?in==% / ?out=% equality / inequality
any %?in>% / %?in>=% / %?in<% / %?in<=% comparison
any %?in()% / %?out()% open interval
any %?in(]% / %?out(]% open left closed right
any %?in[)% / %?out[)% open right closed left
any %?in[]% / %?out[]% closed interval
any %?in{}% / %?out{}% generalized %in%
any %?in~% / %?out~% / %?in~f% / %?out~f% / %?in~p% / %?out~p% regex
karoliskoncevicius commented 4 years ago

Also - do you think we will try to extend this to include things like %cut%? Knowing this before hand might help choosing the appropriate naming conventions.

moodymudskipper commented 4 years ago

what would %cut% do ? I think I'm comfortable to use functions for cut as there are so many ways to do it, I even wrote package just for that :). https://github.com/moodymudskipper/cutr

The syntax with %out% looks good here, its only issue is that it can't be easily extended as every functionality is linked to a symbol, but we're most probably fine with what we have here anyway, and I'm ok with moving on with it.

karoliskoncevicius commented 4 years ago

Yup I am aware of the package. Already complimented it when you shown it on reddit :) Why are you not putting your packages on CRAN? Do you think they are not ready?

As for the suggestion/question cut could basically apply the replace operators multiple times. Like:

x %in[)% c(1,10) <- "low"
x %in[)% c(10,20) <- "medium"
x %in[)% c(20,30) <- "high"

Would be something like:

x %cut[)% list(low=c(1,10), medium=c(10,20), high=c(20,30))

The nice thing is that it would play nicely with all the ranges that we have (even with in~, etc). And would provide a consistent way between subsetting, checking, and "cutting" into groups.

karoliskoncevicius commented 4 years ago

As mentioned in #4 - we can probably drop %@in%, %?in%, %*in%, etc. In the code they would be more confusing than simply using a proper function. i.e.:

if(any(x %in()% c(0, 10))) { ... }

vs

if(x %?in{}% c(0, 10)) { ... }

I would probably choose the first one for clarity.

If that is the case - we are only left with two naming problems:

1) do we use ! or out

2) how to name the subset operator?

Some possibilities:

%[in()%
%[in()]%
%val()% # in this case - how to negate?
%IN()%
%subset()%
moodymudskipper commented 4 years ago

I have dotdot on CRAN (a simple 5 lines function to grow variables without repetition) and will put unglue (text extraction) there as soon as I correct a couple bugs as it had some unexpected twitter success.

My other CRAN candidates are :

The problem is that I always start new stuff and then get overwhelmed, take a break, and come back with a new idea. And I don't have sparring partners, it's the first time here :).

moodymudskipper commented 4 years ago

Isn't this redundant ?

x %cut[)% list(low=c(1,10), medium=c(10,20), high=c(20,30))

What would the following do, and what do we have below 1 and above 30, NAs ?

x %cut[)% list(low=c(1,5), medium=c(10,20), high=c(20,30))
karoliskoncevicius commented 4 years ago

Regarding cut:

I do not think this is redundant, because doing it line by line might not be possible. I.E. first replace of x %in(]% c(1, 10) <- "low" would transform x into character. Unless we expand the ranges to work on multiple ranges at once, then cut would probably be redundant.

Regarding the left out ranges:

yup NA would be proper value for this case

karoliskoncevicius commented 4 years ago

Regarding packages:

Yup I see, you have quite a few. In your case I would probably zoom in on ones I would like to work on and support, and drop the rest, or leave them on GitHub. I've seen tags but have to admit - I did not quite get the purpose of it :/ Maybe spend too little time reading the docs. Out of all the listed ones - I like cutr most. Maybe because that's the one I see using myself.