Closed emma-sumner closed 2 years ago
The code in question is simply this:
Is duplicated
not doing what it’s suppsed to on your data?
Please provide a http://sscce.org/, otherwise I can’t help.
Hi,
Thanks for your response. My data doesn't have duplicates so I'm not sure how or why destiny is picking them up. I'm losing about 20% of my cells when creating the DiffusionMap. I've run the code below and attached my binary gene activation csv file
BinaryRegAct_200122 <- read.csv("~/IL36 Stim Data/SCENIC - ClassMono/BinaryRegAct_200122.csv", row.names=1) View(BinaryRegAct_200122) df <- as.matrix(BinaryRegAct_200122) dm <- DiffusionMap(data = df, k = 1000) Warning messages: 1: In dataset_extract_doublematrix(data, vars) : Duplicate rows removed from data. Consider explicitly using
df[!duplicated(df), ]
2: In (function (data, k, ..., query = NULL, distance = c("euclidean", : find_knn does not yet support sparse matrices, converting data to a dense matrix BinaryRegAct_200122.csv .
Your data has this amount of duplicated rows:
> BinaryRegAct_200122 |> duplicated() |> sum()
[1] 753
see e.g. these ones:
> BinaryRegAct_200122[c('my.data.1_ACCTACCCATCGCTCT', 'my.data.1_ACGGTCGCATTCACAG'), ]
CEBPB EGR1_extended EOMES ETV7_extended FOS_extended FOSB_extended FOXP1 IRF1 IRF7 IRF8
my.data.1_ACCTACCCATCGCTCT 0 1 0 0 1 1 0 0 0 0
my.data.1_ACGGTCGCATTCACAG 0 1 0 0 1 1 0 0 0 0
JUN_extended JUNB_extended JUND MAF MAFB_extended MEF2A MSC_extended NFE2L2 NFKB1 NFKB2 NR1H3
my.data.1_ACCTACCCATCGCTCT 1 1 1 0 0 0 0 0 0 0 0
my.data.1_ACGGTCGCATTCACAG 1 1 1 0 0 0 0 0 0 0 0
POU2F2 REL RUNX3 SPI1 STAT1 STAT2 USF2 ZMIZ1_extended
my.data.1_ACCTACCCATCGCTCT 1 0 0 1 0 0 0 0
my.data.1_ACGGTCGCATTCACAG 1 0 0 1 0 0 0 0
The error message suggests removing them, maybe you want to do this.
Is there any way to bypass the 'dupes' section of the code? I need to include these duplicated lines as they are different cells that show the same gene expression pattern
No, the algorithm can’t handle distances that are 0.
What you can do is
Deduplicate while keeping all names as a list column:
# change to double data type:
BinaryRegAct_200122 <- as.matrix(BinaryRegAct_200122)
mode(BinaryRegAct_200122) <- 'numeric'
BinaryRegAct_200122 <- as.data.frame(BinaryRegAct_200122)
# deduplicate while keeping all names
deduplicated <- BinaryRegAct_200122 |> rownames_to_column('cell') |> group_by(across(!cell)) |> summarise(cell = list(cell))
Create DM: dm <- DiffusionMap(deduplicated, k = ...)
Restore shape dm |> as.data.frame() |> unnest(cell)
I'm trying to create a diffusion map from a binary activation matrix (contents, 0 or 1) made up of 30 genes and approx 4000 cells. All cells have unique barcode identifiers that are set as row names in the matrix.
When I run the line to create a DiffusionMap as indicated in the vingette (where df is my matrix), I cel the following warning message:
RStudio is deleting approximately 700 cells from the matrix as it considers them duplicates. I've gone back to the original csv file that the matrix is created from and cannot find any duplicate rows. I've tried loading in far less cells (n=50) to see if it was a size issue but this problem still persists.
Has anyone experienced this issue or knows where it might originate from?
Thanks Emma