Ambiguous type for find_nonambient_barcodes input

nh3 / emptydrops

Python implementation of emptydrops() in CellRanger v3.0.2

MIT License

3 stars 1 forks source link

Ambiguous type for find_nonambient_barcodes input #1

Open CowanCS1 opened 4 years ago

CowanCS1 commented 4 years ago

Hi,

The function find_nonambient_barcodes lists the input requirement:

orig_cell_bcs (iterable of str): Strings of initially-called cell barcodes.

However, because the default meaning of "str" changed between python2 (bytes) and python 3 (unicode) this broke in python3. Worse, with a unicode input it appears that no good barcodes were identified in the original list and the codes hits an uninformative "return None".

    # Choose candidate cell barcodes
    orig_cell_bc_set = set(orig_cell_bcs)
    orig_cells = np.flatnonzero(np.fromiter((bc in orig_cell_bc_set for bc in matrix.bcs),
                                            count=len(matrix.bcs), dtype=bool))

    # No good incoming cell calls
    if orig_cells.sum() == 0:
        return None

Suggestions for fixing it are 1) Casting each string to bytes before the "set" type casting orig_cell_bcs = tuple( i.encode('ascii') for i in orig_cell_bcs ) 2) Provide an informative exception

Thanks for making this code more accessible!

nh3 commented 4 years ago

We've been running this code in production under python3 for months and haven't found such issue. Could you provide a minimal data example so that we can replicate it?

redst4r commented 3 years ago

Ran into the exact same issue today. Loading the matrix via matrix = CountMatrix.from_anndata(adata), stores the cell barcodes as bytes in matrix.bcs (like b'AAACCTGAGAAACCAT-1'. Since I wasn't aware of that, I supplied the orig_cell_bcs argument as a list of strings (e.g. AAACCTGAGAAACCAT-1),which just returns None. Changing the argument to bytes works, it's just odd/unexpected behavior.

The problem is coming from here due to the bytes-dtype:

bcs = np.array(bcs, dtype='S', copy=False)

Thanks for putting this package together :)

auesro commented 3 years ago

Dear @redst4r,

I am right now in the same situation as you, but probably with less knowledge. Since I am using anndata data, I am directing the question to you but of course @nh3 feel free to help me out here. So far I have:

matrix = CountMatrix.from_anndata(adata)
barcodes = adata.obs_names.values.astype(bytes)

And then:

a = find_nonambient_barcodes(
    matrix,          # Full expression matrix
    orig_cell_bcs=barcodes,   # (iterable of str): Strings of initially-called cell barcodes
    min_umi_frac_of_median=0.01,
    min_umis_nonambient=500,
    max_adj_pvalue=0.01
    )

But all I get is:

Median UMIs of initial cell calls: 1.0
Min UMIs: 500

And a as a NoneType object...any clues would be very helpful.

Thanks

redst4r commented 3 years ago

odd, that's pretty much what I did, i.e. converting the orig_cell_bcs to bytes. Can you check if the barcodes stored in that matrix object actually overlap with what you supply via orig_cell_bcs?