single-cell-genetics / vireo

Demultiplexing pooled scRNA-seq data with or without genotype reference
https://vireoSNP.readthedocs.io
Apache License 2.0
73 stars 27 forks source link

Error when running vireo mode 2 with option -N 1 #83

Open calebthomas259 opened 1 year ago

calebthomas259 commented 1 year ago

Hello vireo developers,

I have a small issue to report, please see the details below:

Background We're performing pooled single nucleus RNA-seq experiments, and are planning to demultiplex the samples with vireo mode 2 using donor-specific SNPs (obtained with shallow WGS for each donor). For testing purposes, we also sequenced some of the samples as individual snRNA-seq libraries (i.e. no pooling).

As a bioinformatics sanity check, we wanted to try running vireo in mode 2 on one of the unpooled libraries, using a VCF file that has genotypes for multiple donors, while specifying -N 1 in the vireo command. E.g. the library has donor A nuclei only, but the VCF file has genotypes for donors A, B, and C.

We were hoping that vireo would correctly identify that the sample is from donor A, and assign all of the nuclei to donor A (or at least the majority of nuclei assigned to donor A, and maybe a few unassigned). But unfortunately we get some error messages instead.

How to reproduce the error When we run this command on our data:

vireo --randSeed=1234 --nproc=14 \
-d "$three_donor_genotypes_vcf" -c "$dir_cellsnp_pool" \
-N 1 \
-o 'temp_vireo'

It gives the following error message (I've shortened some of the filepaths):

[vireo] Loading cell folder ...
[vireo] Loading donor VCF file ...
[vireo] 158953 out 197549 variants matched to donor VCF
[vireo] Demultiplex 27219 cells to 1 donors with 158953 variants.
[vireo] lower bound ranges [-7665415.0, -7665415.0, -7665415.0]
[vireo] allelic rate mean and concentrations:
[[0.4   0.451 0.534]]
[[1981897.3 5929198.2 4157211.5]]
[vireo] donor size before removing doublets:
donor0
27219
Traceback (most recent call last):
  File ".../envs/cellsnp_vireo/bin/vireo", line 8, in <module>
    sys.exit(main())
  File ".../vireoSNP/vireo.py", line 204, in main
    res_vireo = vireo_wrap(cell_dat['AD'], cell_dat['DP'], n_donor=n_donor,
  File ".../vireoSNP/utils/vireo_wrap.py", line 152, in vireo_wrap
    doublet_prob, ID_prob, doublet_LLR = predict_doublet(modelCA, AD, DP)
  File ".../vireoSNP/utils/vireo_doublet.py", line 39, in predict_doublet
    GT_both = add_doublet_GT(vobj.GT_prob)
  File ".../vireoSNP/utils/vireo_doublet.py", line 118, in add_doublet_GT
    s_idx1 = sp_idx[:, 0]
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed

As a workaround, I tried running with doublet detection disabled:

vireo --randSeed=1234 --nproc=14 \
-d "$three_donor_genotypes_vcf" -c "$dir_cellsnp_pool" \
-N 1 \
--noDoublet \
-o 'temp_vireo2'

But this results in a new error message:

[vireo] Loading cell folder ...
[vireo] Loading donor VCF file ...
[vireo] 158953 out 197549 variants matched to donor VCF
[vireo] Demultiplex 27219 cells to 1 donors with 158953 variants.
[vireo] lower bound ranges [-7665415.0, -7665415.0, -7665415.0]
[vireo] allelic rate mean and concentrations:
[[0.4   0.451 0.534]]
[[1981897.3 5929198.2 4157211.5]]
[vireo] donor size before removing doublets:
donor0
27219
Traceback (most recent call last):
  File ".../envs/cellsnp_vireo/bin/vireo", line 8, in <module>
    sys.exit(main())
  File ".../envs/cellsnp_vireo/lib/python3.9/site-packages/vireoSNP/vireo.py", line 217, in main
    write_donor_id(out_dir, donor_names, cell_dat['samples'], n_vars, res_vireo)
  File ".../envs/cellsnp_vireo/lib/python3.9/site-packages/vireoSNP/utils/io_utils.py", line 98, in write_donor_id
    prob_doublet_out = np.max(doublet_prob, axis=1)
  File "<__array_function__ internals>", line 5, in amax
  File ".../envs/cellsnp_vireo/lib/python3.9/site-packages/numpy/core/fromnumeric.py", line 2754, in amax
    return _wrapreduction(a, np.maximum, 'max', axis, None, out,
  File ".../envs/cellsnp_vireo/lib/python3.9/site-packages/numpy/core/fromnumeric.py", line 86, in _wrapreduction
    return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: zero-size array to reduction operation maximum which has no identity

Expected result The command runs successfully without errors, and assigns the cells to donor A

Software versions (installed via conda/pip)

Is it possible to do this with vireo? I was hoping you could please advise of a potential solution or a workaround, if any exist?

Thank you very much for the help!

ghuls commented 1 year ago

Giving -N 1 doesn't make much sense as it would be the same as making no groups. Better use -N 3 (or omit it at all). If you have 1 sample only and you have 3 genotypes in your VCF, vireo should hopefully assign all/most cells to only one of your samples.

vireo --randSeed=1234 --nproc=14 \
-d "$three_donor_genotypes_vcf" -c "$dir_cellsnp_pool" \
-N 3 \
-o 'temp_vireo'