theislab / destiny

R package for single cell and other data analysis using diffusion maps
https://theislab.github.io/destiny/
GNU General Public License v3.0
69 stars 12 forks source link

Matrix rows deleted when creating DiffusionMap #56

Closed emma-sumner closed 2 years ago

emma-sumner commented 2 years ago

I'm trying to create a diffusion map from a binary activation matrix (contents, 0 or 1) made up of 30 genes and approx 4000 cells. All cells have unique barcode identifiers that are set as row names in the matrix.

When I run the line to create a DiffusionMap as indicated in the vingette (where df is my matrix), I cel the following warning message:

dm <- DiffusionMap(data = df, k = 1000) In dataset_extract_doublematrix(data, vars) : Duplicate rows removed from data. Consider explicitly using df[!duplicated(df), ]

RStudio is deleting approximately 700 cells from the matrix as it considers them duplicates. I've gone back to the original csv file that the matrix is created from and cannot find any duplicate rows. I've tried loading in far less cells (n=50) to see if it was a size issue but this problem still persists.

Has anyone experienced this issue or knows where it might originate from?

Thanks Emma

flying-sheep commented 2 years ago

The code in question is simply this:

https://github.com/theislab/destiny/blob/28307e9d5dd755a79a84c2f9049cdd4a2112eacb/R/dataset-helpers.r#L21-L22

Is duplicated not doing what it’s suppsed to on your data?

Please provide a http://sscce.org/, otherwise I can’t help.

emma-sumner commented 2 years ago

Hi,

Thanks for your response. My data doesn't have duplicates so I'm not sure how or why destiny is picking them up. I'm losing about 20% of my cells when creating the DiffusionMap. I've run the code below and attached my binary gene activation csv file

BinaryRegAct_200122 <- read.csv("~/IL36 Stim Data/SCENIC - ClassMono/BinaryRegAct_200122.csv", row.names=1) View(BinaryRegAct_200122) df <- as.matrix(BinaryRegAct_200122) dm <- DiffusionMap(data = df, k = 1000) Warning messages: 1: In dataset_extract_doublematrix(data, vars) : Duplicate rows removed from data. Consider explicitly using df[!duplicated(df), ] 2: In (function (data, k, ..., query = NULL, distance = c("euclidean", : find_knn does not yet support sparse matrices, converting data to a dense matrix BinaryRegAct_200122.csv .

flying-sheep commented 2 years ago

Your data has this amount of duplicated rows:

> BinaryRegAct_200122 |> duplicated() |> sum()
[1] 753

see e.g. these ones:

> BinaryRegAct_200122[c('my.data.1_ACCTACCCATCGCTCT', 'my.data.1_ACGGTCGCATTCACAG'), ]
                           CEBPB EGR1_extended EOMES ETV7_extended FOS_extended FOSB_extended FOXP1 IRF1 IRF7 IRF8
my.data.1_ACCTACCCATCGCTCT     0             1     0             0            1             1     0    0    0    0
my.data.1_ACGGTCGCATTCACAG     0             1     0             0            1             1     0    0    0    0
                           JUN_extended JUNB_extended JUND MAF MAFB_extended MEF2A MSC_extended NFE2L2 NFKB1 NFKB2 NR1H3
my.data.1_ACCTACCCATCGCTCT            1             1    1   0             0     0            0      0     0     0     0
my.data.1_ACGGTCGCATTCACAG            1             1    1   0             0     0            0      0     0     0     0
                           POU2F2 REL RUNX3 SPI1 STAT1 STAT2 USF2 ZMIZ1_extended
my.data.1_ACCTACCCATCGCTCT      1   0     0    1     0     0    0              0
my.data.1_ACGGTCGCATTCACAG      1   0     0    1     0     0    0              0

The error message suggests removing them, maybe you want to do this.

emma-sumner commented 2 years ago

Is there any way to bypass the 'dupes' section of the code? I need to include these duplicated lines as they are different cells that show the same gene expression pattern

flying-sheep commented 2 years ago

No, the algorithm can’t handle distances that are 0.

What you can do is

  1. Deduplicate while keeping all names as a list column:

    # change to double data type:
    BinaryRegAct_200122 <- as.matrix(BinaryRegAct_200122)
    mode(BinaryRegAct_200122) <- 'numeric'
    BinaryRegAct_200122 <- as.data.frame(BinaryRegAct_200122)
    # deduplicate while keeping all names
    deduplicated <- BinaryRegAct_200122 |> rownames_to_column('cell') |> group_by(across(!cell)) |> summarise(cell = list(cell))
  2. Create DM: dm <- DiffusionMap(deduplicated, k = ...)

  3. Restore shape dm |> as.data.frame() |> unnest(cell)