Open hamishking opened 5 years ago
Hi @hamishking,
Assuming your Scrublet object is called scrub
(as in the example notebooks), the scores and doublet predictions (True/False mask) are stored in scrub.doublet_scores_obs_
and scrub.predicted_doublets_
, respectively.
You have a couple straightforward options for exporting these two arrays.
Write two separate files:
import numpy as np
np.savetxt('predicted_doublet_mask.txt', scrub.predicted_doublets_, fmt='%s')
np.savetxt('doublet_scores.txt', scrub.doublet_scores_obs_, fmt='%.4f')
Use the pandas
library (good to know about, especially if you're more familiar with R) to write a single table (.csv/.tsv):
import pandas as pd
df = pd.DataFrame({
'doublet_score': scrub.doublet_scores_obs_,
'predicted_doublet': scrub.predicted_doublets_
})
df.to_csv('scrublet_output_table.csv', index=False)
Also, I should add some more visible documentation, but in general you can find an object or function's docs using the help
function. For example, help(scr.Scrublet)
lists the various attributes of the Scrublet object, including predicted_doublets_
and doublet_scores_obs_
.
Thank you for the help with this @swolock. One final question - will the order of the doublet scores output this way will be exactly the same as the order of the barcodes in the counts_matrix ? If so, as I am using 10X data I guess I can then combine the doublet scores with the cell barcodes with barcodes.tsv just by adding the columns together.
Yes, the scores will exactly match the original rows (=barcodes) of the counts matrix supplied to Scrublet.
Related to this post about exporting results, is there a way to access the estimated standard errors of the doublet scores without going in and modifying the scrub_doublets() function?
Hi @DanSchnell
After running scrub_doublets()
, the standard errors (and a bunch of other stuff) are stored in the attributes of the Scrublet
object (e.g., scrub.doublet_errors_obs_
for the errors for the observed transcriptomes). See here for all of the attributes.
@swolock I have a problem with the type of the input data. My data is not from 10X. I have a matrix like this: gene cell count 0610005C13Rik AATCCGTCAGCAGGAAGAATCTGA 1 0610005C13Rik ACACAGAAGAGCTGAAACTATGCA 1 0610005C13Rik CAAGACTAATGCCTAAACAAGCTA 1 0610005C13Rik CCGTGAGAAGTGGTCAACGCTCGA 1 0610005C13Rik GCGAGTAACAAGACTAATTGAGGA 1 0610006L08Rik AAGACGGAACATTGGCACATTGGC 1 0610006L08Rik AAGACGGACTGTAGCCCAATGGAA 1 0610006L08Rik AAGGACACACACAGAAAAGACGGA 1 0610006L08Rik AAGGACACGCGAGTAAAGTCACTA 1
How can I use Scrublet or did I have to change the format to mtx format? The file is too large to change the format.
Hi @fangling0913
I haven't seen this format before, and I don't know of an existing python function to read it. The following code should do the job, but be warned that it's untested and not at all optimized. After running it, you'll have a counts matrix that can be fed to Scrublet, plus lists of gene names and barcodes in case you need them for other purposes.
import numpy as np
import scipy.sparse as ssp
# specify input file
input_filename = 'input.txt'
""" Format of input.txt:
0610005C13Rik AATCCGTCAGCAGGAAGAATCTGA 1
0610005C13Rik ACACAGAAGAGCTGAAACTATGCA 1
0610005C13Rik CAAGACTAATGCCTAAACAAGCTA 1
0610005C13Rik CCGTGAGAAGTGGTCAACGCTCGA 1
0610005C13Rik GCGAGTAACAAGACTAATTGAGGA 1
0610006L08Rik AAGACGGAACATTGGCACATTGGC 1
0610006L08Rik AAGACGGACTGTAGCCCAATGGAA 1
0610006L08Rik AAGGACACACACAGAAAAGACGGA 1
0610006L08Rik AAGGACACGCGAGTAAAGTCACTA 1
"""
# Initialize dicts for keeping track
# of gene and barcode indices
barcode_ix_dict = {}
gene_ix_dict = {}
# Initialize lists of row names (barcodes)
# and column names (genes)
barcode_list = []
gene_list = []
# Initialize lists of data/indices to be
# used to create scipy sparse coo_matrix
data = []
row_ix = []
col_ix = []
# Read in the data
for iL,line in enumerate(open(input_filename, 'r')):
# skip the first header line
if iL > 0:
gene, barcode, umi = line.strip('\n').split('\t')
if gene not in gene_list:
gene_ix_dict[gene] = len(gene_list)
gene_list.append(gene)
if barcode not in barcode_list:
barcode_ix_dict[barcode] = len(barcode_list)
barcode_list.append(barcode)
col_ix.append(gene_ix_dict[gene])
row_ix.append(barcode_ix_dict[barcode])
data.append(int(umi))
# Create the following:
# `counts_matrix`: scipy.sparse.csc_matrix, to be used as the
# counts matrix (cells x genes) for scrublet,
# scanpy, etc.
# `barcode_list` : numpy array of cell barcode IDs
# `gene_list` : numpy array of gene names
counts_matrix = ssp.coo_matrix(
(data, (row_ix, col_ix)),
shape=(len(barcode_list), len(gene_list))
).tocsc()
barcode_list = np.array(barcode_list)
gene_list = np.array(gene_list)
Yes, the scores will exactly match the original rows (=barcodes) of the counts matrix supplied to Scrublet.
I'm having trouble figuring out the cell barcodes for each row. They don't seem to match up with my Seurat data in terms of number of rows... Is there any way to include them in the exported file as an additional column? Sorry, not a python expert.
Yes, the scores will exactly match the original rows (=barcodes) of the counts matrix supplied to Scrublet.
I'm having trouble figuring out the cell barcodes for each row. They don't seem to match up with my Seurat data in terms of number of rows... Is there any way to include them in the exported file as an additional column? Sorry, not a python expert.
It worked for me. How about remove the arguments (min.cells = 3, min.features = 100, etc.) while "CreateSeuratObject".
Yes, the scores will exactly match the original rows (=barcodes) of the counts matrix supplied to Scrublet.
I'm having trouble figuring out the cell barcodes for each row. They don't seem to match up with my Seurat data in terms of number of rows... Is there any way to include them in the exported file as an additional column? Sorry, not a python expert.
It worked for me. How about remove the arguments (min.cells = 3, min.features = 100, etc.) while "CreateSeuratObject".
Thanks. Turns out I mixed up some samples... : P
Hope it is not too late to add a question here
is it possible to export the df (csv file) with the row names (cell names) in one file?
Yes, the scores will exactly match the original rows (=barcodes) of the counts matrix supplied to Scrublet.
I'm having trouble figuring out the cell barcodes for each row. They don't seem to match up with my Seurat data in terms of number of rows... Is there any way to include them in the exported file as an additional column? Sorry, not a python expert.
It worked for me. How about remove the arguments (min.cells = 3, min.features = 100, etc.) while "CreateSeuratObject".
Hi all, I just wonder that if I can first filtered cells and genes by seurat and then import the raw.data matrix into srublet, rather than the .mtx file from cellranger? In my opinion, arguments (min.cells = 3, min.features = 100, etc.) would help to remove noise in the data and then doublet detection would be more accurate. Is that the truth or not?
Any suggestions would be appreciated! Many thanks!
Best, Yonghui
I know that this is basically dead, but I wanted to add that the columns of the matrix in scrublet using python should be equivalent to the rows generated with the Read10X
function in R with Seurat. If you don't use any cutoffs, the dimensions of the matrix and the csv output generated similar to swolock's comment above should be equivalent.
Hi there, Thanks for a great tool! I have a very basic question that comes from my complete lack of python know-how (I normally work in R). I just want to export the scrublet results so I can integrate them with Seurat in R. What I would like to generate is a table with the Doublet score and Predicted doublet status for each cell barcode and just can't figure out how to interact with the scrublet object to do this. Apologies again for such a basic question! Hamish