swolock / scrublet

Detect doublets in single-cell RNA-seq data
MIT License
138 stars 73 forks source link

Export scrublet results #5

Open hamishking opened 5 years ago

hamishking commented 5 years ago

Hi there, Thanks for a great tool! I have a very basic question that comes from my complete lack of python know-how (I normally work in R). I just want to export the scrublet results so I can integrate them with Seurat in R. What I would like to generate is a table with the Doublet score and Predicted doublet status for each cell barcode and just can't figure out how to interact with the scrublet object to do this. Apologies again for such a basic question! Hamish

swolock commented 5 years ago

Hi @hamishking, Assuming your Scrublet object is called scrub (as in the example notebooks), the scores and doublet predictions (True/False mask) are stored in scrub.doublet_scores_obs_ and scrub.predicted_doublets_, respectively.

You have a couple straightforward options for exporting these two arrays.

  1. Write two separate files:

    import numpy as np
    np.savetxt('predicted_doublet_mask.txt', scrub.predicted_doublets_, fmt='%s')
    np.savetxt('doublet_scores.txt', scrub.doublet_scores_obs_, fmt='%.4f')
  2. Use the pandas library (good to know about, especially if you're more familiar with R) to write a single table (.csv/.tsv):

    import pandas as pd
    df = pd.DataFrame({
    'doublet_score': scrub.doublet_scores_obs_,
    'predicted_doublet': scrub.predicted_doublets_
    })
    df.to_csv('scrublet_output_table.csv', index=False)
swolock commented 5 years ago

Also, I should add some more visible documentation, but in general you can find an object or function's docs using the help function. For example, help(scr.Scrublet) lists the various attributes of the Scrublet object, including predicted_doublets_ and doublet_scores_obs_.

hamishking commented 5 years ago

Thank you for the help with this @swolock. One final question - will the order of the doublet scores output this way will be exactly the same as the order of the barcodes in the counts_matrix ? If so, as I am using 10X data I guess I can then combine the doublet scores with the cell barcodes with barcodes.tsv just by adding the columns together.

swolock commented 5 years ago

Yes, the scores will exactly match the original rows (=barcodes) of the counts matrix supplied to Scrublet.

DanSchnell commented 5 years ago

Related to this post about exporting results, is there a way to access the estimated standard errors of the doublet scores without going in and modifying the scrub_doublets() function?

swolock commented 5 years ago

Hi @DanSchnell

After running scrub_doublets(), the standard errors (and a bunch of other stuff) are stored in the attributes of the Scrublet object (e.g., scrub.doublet_errors_obs_ for the errors for the observed transcriptomes). See here for all of the attributes.

fangling0913 commented 5 years ago

@swolock I have a problem with the type of the input data. My data is not from 10X. I have a matrix like this: gene cell count 0610005C13Rik AATCCGTCAGCAGGAAGAATCTGA 1 0610005C13Rik ACACAGAAGAGCTGAAACTATGCA 1 0610005C13Rik CAAGACTAATGCCTAAACAAGCTA 1 0610005C13Rik CCGTGAGAAGTGGTCAACGCTCGA 1 0610005C13Rik GCGAGTAACAAGACTAATTGAGGA 1 0610006L08Rik AAGACGGAACATTGGCACATTGGC 1 0610006L08Rik AAGACGGACTGTAGCCCAATGGAA 1 0610006L08Rik AAGGACACACACAGAAAAGACGGA 1 0610006L08Rik AAGGACACGCGAGTAAAGTCACTA 1

How can I use Scrublet or did I have to change the format to mtx format? The file is too large to change the format.

swolock commented 5 years ago

Hi @fangling0913

I haven't seen this format before, and I don't know of an existing python function to read it. The following code should do the job, but be warned that it's untested and not at all optimized. After running it, you'll have a counts matrix that can be fed to Scrublet, plus lists of gene names and barcodes in case you need them for other purposes.

import numpy as np
import scipy.sparse as ssp

# specify input file
input_filename = 'input.txt'

""" Format of input.txt: 
0610005C13Rik   AATCCGTCAGCAGGAAGAATCTGA    1
0610005C13Rik   ACACAGAAGAGCTGAAACTATGCA    1
0610005C13Rik   CAAGACTAATGCCTAAACAAGCTA    1
0610005C13Rik   CCGTGAGAAGTGGTCAACGCTCGA    1
0610005C13Rik   GCGAGTAACAAGACTAATTGAGGA    1
0610006L08Rik   AAGACGGAACATTGGCACATTGGC    1
0610006L08Rik   AAGACGGACTGTAGCCCAATGGAA    1
0610006L08Rik   AAGGACACACACAGAAAAGACGGA    1
0610006L08Rik   AAGGACACGCGAGTAAAGTCACTA    1
"""

# Initialize dicts for keeping track 
# of gene and barcode indices
barcode_ix_dict = {}
gene_ix_dict = {}

# Initialize lists of row names (barcodes)
# and column names (genes)
barcode_list = []
gene_list = []

# Initialize lists of data/indices to be 
# used to create scipy sparse coo_matrix
data = []
row_ix = []
col_ix = []

# Read in the data
for iL,line in enumerate(open(input_filename, 'r')):
    # skip the first header line
    if iL > 0:
        gene, barcode, umi = line.strip('\n').split('\t')
        if gene not in gene_list:
            gene_ix_dict[gene] = len(gene_list)
            gene_list.append(gene)
        if barcode not in barcode_list:
            barcode_ix_dict[barcode] = len(barcode_list)
            barcode_list.append(barcode)
        col_ix.append(gene_ix_dict[gene])
        row_ix.append(barcode_ix_dict[barcode])
        data.append(int(umi))

# Create the following:
#  `counts_matrix`: scipy.sparse.csc_matrix, to be used as the
#                   counts matrix (cells x genes) for scrublet, 
#                   scanpy, etc.
#  `barcode_list` : numpy array of cell barcode IDs
#  `gene_list`    : numpy array of gene names

counts_matrix = ssp.coo_matrix(
    (data, (row_ix, col_ix)), 
    shape=(len(barcode_list), len(gene_list))
).tocsc()
barcode_list = np.array(barcode_list)
gene_list = np.array(gene_list)
mmarchin commented 4 years ago

Yes, the scores will exactly match the original rows (=barcodes) of the counts matrix supplied to Scrublet.

I'm having trouble figuring out the cell barcodes for each row. They don't seem to match up with my Seurat data in terms of number of rows... Is there any way to include them in the exported file as an additional column? Sorry, not a python expert.

Wenfei-Sun commented 4 years ago

Yes, the scores will exactly match the original rows (=barcodes) of the counts matrix supplied to Scrublet.

I'm having trouble figuring out the cell barcodes for each row. They don't seem to match up with my Seurat data in terms of number of rows... Is there any way to include them in the exported file as an additional column? Sorry, not a python expert.

It worked for me. How about remove the arguments (min.cells = 3, min.features = 100, etc.) while "CreateSeuratObject".

mmarchin commented 4 years ago

Yes, the scores will exactly match the original rows (=barcodes) of the counts matrix supplied to Scrublet.

I'm having trouble figuring out the cell barcodes for each row. They don't seem to match up with my Seurat data in terms of number of rows... Is there any way to include them in the exported file as an additional column? Sorry, not a python expert.

It worked for me. How about remove the arguments (min.cells = 3, min.features = 100, etc.) while "CreateSeuratObject".

Thanks. Turns out I mixed up some samples... : P

yeroslaviz commented 4 years ago

Hope it is not too late to add a question here

is it possible to export the df (csv file) with the row names (cell names) in one file?

yingyonghui commented 4 years ago

Yes, the scores will exactly match the original rows (=barcodes) of the counts matrix supplied to Scrublet.

I'm having trouble figuring out the cell barcodes for each row. They don't seem to match up with my Seurat data in terms of number of rows... Is there any way to include them in the exported file as an additional column? Sorry, not a python expert.

It worked for me. How about remove the arguments (min.cells = 3, min.features = 100, etc.) while "CreateSeuratObject".

Hi all, I just wonder that if I can first filtered cells and genes by seurat and then import the raw.data matrix into srublet, rather than the .mtx file from cellranger? In my opinion, arguments (min.cells = 3, min.features = 100, etc.) would help to remove noise in the data and then doublet detection would be more accurate. Is that the truth or not?

Any suggestions would be appreciated! Many thanks!

Best, Yonghui

ColeKeenum commented 3 years ago

I know that this is basically dead, but I wanted to add that the columns of the matrix in scrublet using python should be equivalent to the rows generated with the Read10X function in R with Seurat. If you don't use any cutoffs, the dimensions of the matrix and the csv output generated similar to swolock's comment above should be equivalent.