stemangiola / tidySingleCellExperiment

Brings SingleCellExperiment objects to the tidyverse
https://stemangiola.github.io/tidySingleCellExperiment/index.html
34 stars 7 forks source link

group_split #97

Closed B0ydT closed 6 months ago

B0ydT commented 8 months ago

I've got a group_split implementation working for row and column data. I can work on group_by if you're happy with it. If not, let me know where it needs work.

I was a bit stuck overthinking some decisions but just decided to go for it, so I'm very happy to change things up. For example, I wasn't sure if it would be preferable to specify grouping by row/column or autodetecting like I ended up doing.

Issue #71

stemangiola commented 8 months ago

Does the function works with an arbitrary set of variables?

B0ydT commented 8 months ago

Does the function works with an arbitrary set of variables?

It does now!

B0ydT commented 8 months ago

I refactored the code in an attempt to fix the error. The function works when I run it myself and during unit tests but fails when I rebuild and run R CMD Check. It throws an error saying unite doesn't like tidySCE objects, but I have explicitly converted the object to a tibble, so I don't understand why that is still happening.

stemangiola commented 7 months ago

Hello @B0ydT, any news about this PR? Let me know if you need help/more explanation.

stemangiola commented 7 months ago

ping

B0ydT commented 6 months ago

Thanks for your feedback. It was very clear and I addressed most of it. I'm not 100% sure where you want to use select, though.

I'm still getting an R CMD Check error for the call to unite. The function works just fine, but generates an error when checked.

Edit: My example was pbmc_small |> group_split(pbmc_small, groups) 🤦‍♂️

B0ydT commented 6 months ago

I'm not sure it's exactly the method you had in mind, but I think I've fixed it. I totally forgot those logical statements in group_by exist, as I never really use them myself.

I saw a lot of your other methods make use of the original dplyr functions. I had given up on group_split, because I didn't see how a list of tbls would help me split the SCE object, but I just came across group_rows, which allows you to extract the indices from a grouped tbl!

It does not add those "PC_1>0" type columns with logical values yet, but I should be able to add those shortly.

B0ydT commented 6 months ago

Have added the tests.

I am trying to sidestep all of the name corrections so that the names of new columns are consistent with what you'd get from the dplyr functions, i.e. groups=="g1". This is also necessary so that they can be dropped when .keep = FALSE.

The closest I've gotten is

  colData(.tbl) <- .tbl |> 
    colData() |> 
    as_tibble() |> 
    dplyr::mutate(!!!var_list) |> 
    DataFrame(check.names = FALSE)

I can use colData(.tbl) to see that the names made it in unchanged, but any of the dplyr methods for SCEs 'fix' the names. I could, of course, add the columns after the split, but I don't think anyone wants to see groups.....g1. in their column names, and I'd still need to rethink the .keep method. Of course, if the underlying package heavily relies on the assumption that names will all be correct, then my ideal solution may not be viable.

stemangiola commented 6 months ago

Have added the tests.

I am trying to sidestep all of the name corrections so that the names of new columns are consistent with what you'd get from the dplyr functions, i.e. groups=="g1". This is also necessary so that they can be dropped when .keep = FALSE.

The closest I've gotten is

  colData(.tbl) <- .tbl |> 
    colData() |> 
    as_tibble() |> 
    dplyr::mutate(!!!var_list) |> 
    DataFrame(check.names = FALSE)

I can use colData(.tbl) to see that the names made it in unchanged, but any of the dplyr methods for SCEs 'fix' the names. I could, of course, add the columns after the split, but I don't think anyone wants to see groups.....g1. in their column names, and I'd still need to rethink the .keep method. Of course, if the underlying package heavily relies on the assumption that names will all be correct, then my ideal solution may not be viable.

don't see the problem. this looks good to me

 pbmc_small |> group_split(PC_1>0 & groups == "g2") %>% .[[1]] |> select(groups)
tidySingleCellExperiment says: Key columns are missing. A data frame is returned for independent data analysis.
# A tibble: 75 × 1
   groups
   <chr> 
 1 g2    
 2 g1    
 3 g2    
 4 g2    
 5 g2    
 6 g1    
 7 g1    
 8 g1    
 9 g1    
10 g1    

If it behaves well enough for the vast majority of use cases, I would say let's go with this, and we can improve it in the future.

It would be good to translate this to tidyseurat, and the more complicated tidySummarizedExperiment.

stemangiola commented 6 months ago

Congrats @B0ydT !

Let me know what you think about repurposing your PR.