KeyError: 'Cluster' - Githubissues

IBDgenomics commented 1 year ago

1) We followed the discussion with hernandezvargash to tweak the installation in our cluster, and now the test data runs smoothly and mimics the ouput provided here.

2) With real data (~10000 cells, 4 samples), it runs almost entirely but I'm getting an error at the very end, see below.

`mixtools package, version 2.0.0, Released 2022-12-04 This package is based upon work supported by the National Science Foundation under Grant No. SES-0518772 and the Chan Zuckerberg Initiative: Essential Open Source Software for Science (Grant No. 2020-255193).

/hpc/apps/mitosplitter/20230422/lib/python3.9/site-packages/scanpy/preprocessing/_highly_variable_genes.py:62: UserWarning: flavor='seurat_v3' expects raw count data, but non-integers were found. warnings.warn( /hpc/apps/mitosplitter/20230422/lib/python3.9/site-packages/numpy/lib/function_base.py:2829: RuntimeWarning: invalid value encountered in true_divide c /= stddev[:, None] /hpc/apps/mitosplitter/20230422/lib/python3.9/site-packages/numpy/lib/function_base.py:2830: RuntimeWarning: invalid value encountered in true_divide c /= stddev[None, :] /hpc/apps/mitosplitter/20230422/scripts/mitoSplitter_prob.py:118: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access Y_pred_prob.columnsumns = label_coding.keys() /hpc/apps/mitosplitter/20230422/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:780: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2. warnings.warn( /hpc/apps/mitosplitter/20230422/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2. warnings.warn(

'Traceback (most recent call last): File "/hpc/apps/mitosplitter/20230422/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3629, in get_loc return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'Cluster'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/hpc/apps/mitosplitter/20230422/scripts/mitoSplitter_prob_info.py", line 26, in count_labels = cc(true_labels['Cluster'].tolist()) File "/hpc/apps/mitosplitter/20230422/lib/python3.9/site-packages/pandas/core/frame.py", line 3505, in getitem indexer = self.columns.get_loc(key) File "/hpc/apps/mitosplitter/20230422/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3631, in get_loc raise KeyError(key) from err KeyError: Cluster'

3. The ouput is almost complete, but the cluster files seem to be truncated (~2,000 barcodes only), and the probability files are missing:

-rw-r--r-- 1 2762250 Nov 9 16:29 mitoSplitter.hvf_meta.scanpy.txt -rw-r--r-- 1 14122 Nov 9 16:29 mitoSplitter.hvf.scanpy.txt -rw-r--r-- 1 522942 Nov 9 16:29 mitoSplitter.variant_selection.pdf -rw-r--r-- 1 6441 Nov 9 16:29 mitoSplitter.hvf2.scanpy.txt -rw-r--r-- 1 367372 Nov 9 16:29 mitoSplitter.clusters_corr.txt -rw-r--r-- 1 11699 Nov 9 16:29 mitoSplitter.corr_hist.pdf -rw-r--r-- 1 220604 Nov 9 16:29 mitoSplitter.clusters_prob.txt -rw-r--r-- 1 117923 Nov 9 16:29 mitoSplitter.clusters.txt -rw-r--r-- 1 421675 Nov 9 16:29 mitoSplitter.tSNE.pdf

I'm not sure how to figure this out, hope you can help.

Thanks in advance.

lnscan commented 1 year ago

1) Glad to see the updated environment works.

2) This is an error from the script mitoSplitter_prob_info.py. It is used to test the performance of mitoSplitter and only works when the origin of each cell is known. I will modify the script to skip this step when the -g parameter is not used. And the error will not affect the results of demultiplexing.

3) The number of lines in mitoSplitter.clusters.txt should be the same as the number of columns in whole_all.af.txt. You can check: 3.1 How many cells are included in your whole_all.af.txt and whether the variants (each line) are distributed throughout the entire mitochondrial genome. 3.2 The base and alignment quaility in single-cell remapping bam file, since the parameters -q and -a in mitoSplitter_pipeline.sh will filter low-quality reads in single-cell bam file. In addition, the probability files (mitoSplitter_filter_prob0.9999999999.sample_ROC.pdf, mitoSplitter_filter_prob0.9999999999_each_perform.txt) and mitoSplitter_filter_prob_lineplot.txt are generated from mitoSplitter_prob_info.py only for performance validation, so don't worry about it.

IBDgenomics commented 1 year ago

Thanks, however I'm using a benchmark barcode file precisely to run that step. And as I said, I was able to run the example without issues.

I can try without the benchmark barcode file to test if I can get a complete output, not a truncated one. I'll post the results later.

Thanks,

lnscan commented 1 year ago

I accidentally posted a message that was half written. Please make sure you see a reply with 3 main points.

IBDgenomics commented 1 year ago

Thanks for the advice. I'm still getting a truncated results even without -g.

whole_all.af.txt has close to 11,000 columns, but mitoSplitter.clusters.txt has 2867
I'm working with 4 samples, the number of alignments in the bam file are ~40M, with ~30M alignment quality>20.
The variants in whole_all.af.txt cover the whole MT genome (~10K bp are represented for each AGCT)

The tail of the error file, without -g:

[E::idx_find_and_load] Could not retrieve index file for 'DevkotaData_queue_numpyupdate_scikitmisc_nobench/bam_split_by_bc1/TTTGTTGGTATGCTAC-1.bam' mixtools package, version 2.0.0, Released 2022-12-04 This package is based upon work supported by the National Science Foundation under Grant No. SES-0518772 and the Chan Zuckerberg Initiative: Essential Open Source Software for Science (Grant No. 2020-255193).

/hpc/apps/mitosplitter/20230422/lib/python3.9/site-packages/scanpy/preprocessing/_highly_variable_genes.py:62: UserWarning: flavor='seurat_v3' expects raw count data, but non-integers were found. warnings.warn( /hpc/apps/mitosplitter/20230422/lib/python3.9/site-packages/numpy/lib/function_base.py:2829: RuntimeWarning: invalid value encountered in true_divide c /= stddev[:, None] /hpc/apps/mitosplitter/20230422/lib/python3.9/site-packages/numpy/lib/function_base.py:2830: RuntimeWarning: invalid value encountered in true_divide c /= stddev[None, :] /hpc/apps/mitosplitter/20230422/scripts/mitoSplitter_prob.py:118: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access Y_pred_prob.columnsumns = label_coding.keys() /hpc/apps/mitosplitter/20230422/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:780: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2. warnings.warn( /hpc/apps/mitosplitter/20230422/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2. warnings.warn( Traceback (most recent call last): File "/hpc/apps/mitosplitter/20230422/scripts/mitoSplitter_prob_info.py", line 22, in prefix = sys.argv[2] IndexError: list index out of range

lnscan commented 1 year ago

I updated mitoSplitter_pipeline.sh and hope to skip running mitoSplitter_prob_info.py without -g. And may I ask if you have used the -p or -d parameters to filter doublets?

IBDgenomics commented 1 year ago

I'm using -d all the time. The final output folder looks like:

2762127 Nov 13 11:41 mitoSplitter.hvf_meta.scanpy.txt 14122 Nov 13 11:41 mitoSplitter.hvf.scanpy.txt 522970 Nov 13 11:41 mitoSplitter.variant_selection.pdf 6441 Nov 13 11:41 mitoSplitter.hvf2.scanpy.txt 367372 Nov 13 11:41 mitoSplitter.clusters_corr.txt 11699 Nov 13 11:41 mitoSplitter.corr_hist.pdf 220601 Nov 13 11:41 mitoSplitter.clusters_prob.txt 117923 Nov 13 11:41 mitoSplitter.clusters.txt 422071 Nov 13 11:41 mitoSplitter.tSNE.pdf

The parent folder

   320 Nov 13 10:54 bulk_af.list

2484300 Nov 13 10:54 bulk_all.af.txt 258 Nov 13 10:54 bulk_all.af.cor.txt 2833606416 Nov 13 11:00 minitagged.bam.sorted_CB 97 Nov 13 11:31 whole_all.A.alt 97 Nov 13 11:31 whole_all.C.alt 97 Nov 13 11:31 whole_all.G.alt 97 Nov 13 11:31 whole_all.T.alt 97 Nov 13 11:31 whole_all.coverage.A.alt 97 Nov 13 11:31 whole_all.coverage.C.alt 97 Nov 13 11:31 whole_all.coverage.G.alt 97 Nov 13 11:31 whole_all.coverage.T.alt 1881949017 Nov 13 11:37 whole_all.af.txt 54606 Nov 13 11:40 Favg_Gaussian_fitted_distribution.pdf 54454 Nov 13 11:40 Favg_Gaussian_fitted_singlet.list 153881 Nov 13 11:40 Favg_Gaussian_fitted_doublet.list 415 Nov 13 11:41 mitoSplitter

lnscan commented 1 year ago

You can try using the - p parameter instead, which will use scrublet to remove doublets based on the gene expression profiles.

IBDgenomics commented 1 year ago

So the problem seemed to be a syntax thing (with the -g option). We can now run on real (simulated pool) data with good final results, including the -g option.

The problem with the truncated output was not such. What happened is that the Favg Gaussian fitted model detects a lot of doublets in this simulated pool (reasons unknown, might be dataset-dependent), and only singlets are fed into the demultiplex step. So we will check with other libraries.
We have not tried scrublet at this point, as we have generated this pool in-silico. For -p we would also need to fabricate a filtered file set to mimic cellranger's output.

Thanks for your help

lnscan / mitoSplitter

KeyError: 'Cluster' #2