r-lib / vctrs

Generic programming with typed R vectors
https://vctrs.r-lib.org
Other
287 stars 66 forks source link

Error in `vctrs::vec_locate_matches()`: when tryin to join multiple tables #1871

Closed Marlinski95 closed 1 year ago

Marlinski95 commented 1 year ago

Hello, I am trying to join multiple tables (n=35) by the column named "Peptide". The data looks as follows:

head(df1) Peptide Glycan Site Area 1 AAFNAQNNGSNFQLEEISR Hex(5)HexNAc(4)NeuAc(2) P02765@176; 5598058 2 AALAAFNAQNNGSNFQLEEISR Hex(5)HexNAc(4)Fuc(2) P02765@176; 1464526 3 AALAAFNAQNNGSNFQLEEISR Hex(4)HexNAc(3)NeuAc(1) P02765@176; 3515959 4 AALAAFNAQNNGSNFQLEEISR Hex(5)HexNAc(4)NeuAc(1) P02765@176; 69934316 5 AALAAFNAQNNGSNFQLEEISR Hex(6)HexNAc(4)NeuAc(1) P02765@176; 1677173 6 AALAAFNAQNNGSNFQLEEISR Hex(5)HexNAc(4)NeuAc(2) P02765@176; 702918502

I created a list for the dfs I want to join:

df_list <- list(df1, df2, df3, df4, df5, df6, df7, df8, df9, df10, df11, df12, df13, df14, df15, df16, df17, df18, df19, df20, df21, df22, df23, df24, df25, df26, df27, df28, df29, df30, df31, df32, df33, df34, df35)

There will most likely be multiple matches/or missing data for some dfs depending on the presence/absence of a certain peptide. I ran the following command and received an error/prompt to report this errror:

df_list %>% reduce(full_join, by='Peptide') Error in vctrs::vec_locate_matches(): ! Match procedure results in an allocation larger than 2^31-1 elements. Attempted allocation size was 2978190961. ℹ In file match.c at line 2658. ℹ This is an internal error that was detected in the vctrs package. Please report it at https://github.com/r-lib/vctrs/issues with a reprex and the full backtrace. Backtrace: ▆

  1. ├─df_list %>% reduce(full_join, by = "Peptide")
  2. ├─purrr::reduce(., full_join, by = "Peptide")
  3. │ └─purrr:::reduce_impl(.x, .f, ..., .init = .init, .dir = .dir)
  4. │ ├─dplyr (local) fn(out, elt, ...)
  5. │ └─dplyr:::full_join.data.frame(out, elt, ...)
  6. │ └─dplyr:::join_mutate(...)
  7. │ └─dplyr:::join_rows(...)
  8. │ └─dplyr:::dplyr_locate_matches(...)
  9. │ ├─base::withCallingHandlers(...)
  10. │ └─vctrs::vec_locate_matches(...)
  11. └─rlang:::stop_internal_c_lib(...)
  12. └─rlang::abort(message, call = call, .internal = TRUE, .frame = frame)

Can you assist me with this? Cheers, Marlene

DavisVaughan commented 1 year ago

Your join would result in over 2.9 billion rows (2978190961). You should check your join keys carefully, as you are likely missing one or you are joining on something that is resulting in an explosive amount of multiple-matches

Marlinski95 commented 1 year ago

Hi, Thanks for getting back to me so quickly! There will be quite a lot of multiple matches but I might need to reconsider that setting I guess...Is there a way to exclude multiple matches but include matches that have been found in df but not in another?

Best,

DavisVaughan commented 1 year ago

I'm not really sure what you are asking for. You'd need to provide a full reprex for us to help you further.

If you've never heard of a reprex before, you might want to start by reading the tidyverse.org help page.

You can install reprex by running (you may already have it, though, if you have the tidyverse package installed):

install.packages("reprex")

Thanks

Marlinski95 commented 1 year ago

Ok, thank you! I will look into that