Open stemangiola opened 9 months ago
This error suggests that you have multiple rows in your input metadata with the same cell ID (ie the cell_
column). Can you double check that your query isn't producing duplicates? For example, is it possible your pmap
is binding multiple rows with the same cell ID?
Sorry I should have tested it before, you were right.
If it's not too annoying we could capture this error with a more informative message.
CuratedAtlasQueryR says: ...... Please check if your input metadata does not include duplicated elements in the `cell_` column. For example, execute `<your input metadata> |> count(cell, name = "number_of_cell_id_instances") |> filter(number_of_cell_id_instances > 1)`
Would you rather I test the input data frame for duplicates (big performance implications), or just catch errors resulting from the code where I try to set the row names, and throw a better error message?
Would
input |> pull(cell_) |> duplicates() |> length() > 0
take long for 100M rows?
or faster methods here
https://stackoverflow.com/questions/37148567/fastest-way-to-remove-all-duplicates-in-r
or just to check if duplicates exist -> anyDuplicated
...
But maybe catching the error is the actually right thing to do, as it is exactly what we are doing, replacing an error with another.
Yeah the performance hit probably won't be too bad compared to the time it takes to actually download and process the data. I think the best function to use to detect duplicates would be one that dbplyr
supports so it can be run in the database instead of purely in R.
Up to you though.
the input could easily be a tibble, incase you manipulate first.
I think catching the error is the most transparent thing we can do.
Hi @multimeric ,
I get this error for this query