maqin2001 / qubic-r-package

Other
0 stars 0 forks source link

cell and gene names #8

Closed PegasusAM closed 6 years ago

PegasusAM commented 6 years ago

We may want to first add a duplication check for genes and cells. If identical names (genes or cells) exist, export a warning indicates the names and locations. considering the importance of correct cell names needed for ARI, let's create a hashset to replace the cell names in the very beginning. After read the expression data, assign numbers from 1 to n to each cell name and we will keep using the serial number throughout the pipeline (but headers are still needed, just replace the real name with numbers). in this way, we can delete all cell name comparisons but only check any missing numbers The same to the gene list but need to distinguish from cell hashset. these two functions may be added before discretization, either implanted to current function or create a new function called "sc_nc" (name control)

zy26 commented 6 years ago

@maqin2001, @PegasusAM, I do not recommend duplication checking in our function. This will slow down our program. The users of our program should be responsible for the correctness of their cell and gene names.

One way to handle this situation is to use the function make.names with the unique = TRUE option in R, see ?make.names.

PegasusAM commented 6 years ago

@zy26 what about the hashset? also not recommend?