sfirke / janitor

simple tools for data cleaning in R
http://sfirke.github.io/janitor/
Other
1.38k stars 132 forks source link

Feature Request: join_many_to_one() and join_one_to_one() #427

Open billdenney opened 3 years ago

billdenney commented 3 years ago

I often need to perform join operations which confirm that the join is either 1:1 or many:1. I have written functions for this in my personal package, but I think that they may be of general interest. That said, they are a bit of a stretch for normal janitor functionality, so if you don't think it's a good fit, no worries.

billdenney commented 2 years ago

@sfirke , If I make a PR for this, do you think it would be of interest? (And no worries if you think it's out of scope.)

sfirke commented 2 years ago

Sorry I missed this the first time. Can you explain a little more? I'm reading that as a wrapper for dplyr::left_join that fails if a record from the left table is matched to more than one in the right table?

billdenney commented 2 years ago

You've effectively got it. For join_one_to_one() it is a wrapper for full_join() that ensures that each row in both the left and right arguments is matched exactly once. For join_many_to_one() it is a wrapper for left_join() that ensures that each row in the left side matches exactly one row in the right table (but, it does not require that all rows in the right table are matched).

In my personal package, I have a more complex join_control() function that lets you specify several kinds of requirements on the join like no missing rows, no duplicated rows, all rows must be present, etc. and you can combine the requirements.

jzadra commented 2 years ago

If you don't put this in janitor, I'd love to see your personal code. I also run into these questions after a join. Looking at the row numbers before and after tells you something, but this sounds like it's much more specific.

billdenney commented 2 years ago

@jzadra , I just made a PR for them. Yes, the personal code is much more extensive in its testing. It creates a row number column for each of the input data.frames and then checks that the output row numbering is as-expected for missingness, uniqueness, etc.

jzadra commented 2 years ago

@billdenney Sounds great. It would be cool if it also provided a summary of which variables were used in the join match that ended up producing multiple rows (and showing the value of those variables), or that didn't have a match and resulted in NAs for the variables from the table being added. Happy to help with this, I think I'd use it a lot.

billdenney commented 2 years ago

@jzadra , It does that summary reporting, too. 😄

jzadra commented 2 years ago

@billdenney Sweet! Is your personal package public?

billdenney commented 2 years ago

It's kinda, sorta public. It is public in that it's on GitHub. You're welcome to check it out (it's my bsd.report repo). But, it should not be relied upon to be stable. I change things in it all the time and do not consider the API to be stable at all. And, I'm aware of the parts of it that are well-thought-through and tested vs the ones that are one-off hacks that got some job done for a given day.

billdenney commented 2 years ago

@jzadra , For this particular feature, I'd suggest going to PR #480.

sda030 commented 2 years ago

Any reason why no suggesting this for the dplyr package? I think many people have similar issues when joining, and I prefer the idea of having all join functions in the same package - mentally easier to remember that one package does everything related to a topic rather than splitting across several. I also think that you join_control is the better selling point than the join_one_to_one(). Perhaps the latter convenience wrapper is better suited for jantitor?

billdenney commented 2 years ago

@sda030 , That's a good question. I thought I had proposed something like this in dplyr in the past, but I had not. I'll put a feeler out over there to see if it would be of interest for dplyr.

billdenney commented 2 years ago

@sda030 , And I've now added an issue over there to see if it would be of interest.