Open IBDgenomics opened 1 year ago
1) Glad to see the updated environment works.
2) This is an error from the script mitoSplitter_prob_info.py
. It is used to test the performance of mitoSplitter and only works when the origin of each cell is known. I will modify the script to skip this step when the -g parameter is not used. And the error will not affect the results of demultiplexing.
3) The number of lines in mitoSplitter.clusters.txt
should be the same as the number of columns in whole_all.af.txt
. You can check:
3.1 How many cells are included in your whole_all.af.txt
and whether the variants (each line) are distributed throughout the entire mitochondrial genome.
3.2 The base and alignment quaility in single-cell remapping bam file, since the parameters -q
and -a
in mitoSplitter_pipeline.sh
will filter low-quality reads in single-cell bam file.
In addition, the probability files (mitoSplitter_filter_prob0.9999999999.sample_ROC.pdf
, mitoSplitter_filter_prob0.9999999999_each_perform.txt
) and mitoSplitter_filter_prob_lineplot.txt
are generated from mitoSplitter_prob_info.py
only for performance validation, so don't worry about it.
Thanks, however I'm using a benchmark barcode file precisely to run that step. And as I said, I was able to run the example without issues.
I can try without the benchmark barcode file to test if I can get a complete output, not a truncated one. I'll post the results later.
Thanks,
I accidentally posted a message that was half written. Please make sure you see a reply with 3 main points.
Thanks for the advice. I'm still getting a truncated results even without -g.
The tail of the error file, without -g:
[E::idx_find_and_load] Could not retrieve index file for 'DevkotaData_queue_numpyupdate_scikitmisc_nobench/bam_split_by_bc1/TTTGTTGGTATGCTAC-1.bam' mixtools package, version 2.0.0, Released 2022-12-04 This package is based upon work supported by the National Science Foundation under Grant No. SES-0518772 and the Chan Zuckerberg Initiative: Essential Open Source Software for Science (Grant No. 2020-255193).
/hpc/apps/mitosplitter/20230422/lib/python3.9/site-packages/scanpy/preprocessing/_highly_variable_genes.py:62: UserWarning: flavor='seurat_v3'
expects raw count data, but non-integers were found.
warnings.warn(
/hpc/apps/mitosplitter/20230422/lib/python3.9/site-packages/numpy/lib/function_base.py:2829: RuntimeWarning: invalid value encountered in true_divide
c /= stddev[:, None]
/hpc/apps/mitosplitter/20230422/lib/python3.9/site-packages/numpy/lib/function_base.py:2830: RuntimeWarning: invalid value encountered in true_divide
c /= stddev[None, :]
/hpc/apps/mitosplitter/20230422/scripts/mitoSplitter_prob.py:118: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
Y_pred_prob.columnsumns = label_coding.keys()
/hpc/apps/mitosplitter/20230422/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:780: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
warnings.warn(
/hpc/apps/mitosplitter/20230422/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
warnings.warn(
Traceback (most recent call last):
File "/hpc/apps/mitosplitter/20230422/scripts/mitoSplitter_prob_info.py", line 22, in
I updated mitoSplitter_pipeline.sh and hope to skip running mitoSplitter_prob_info.py
without -g
.
And may I ask if you have used the -p
or -d
parameters to filter doublets?
I'm using -d all the time. The final output folder looks like:
2762127 Nov 13 11:41 mitoSplitter.hvf_meta.scanpy.txt 14122 Nov 13 11:41 mitoSplitter.hvf.scanpy.txt 522970 Nov 13 11:41 mitoSplitter.variant_selection.pdf 6441 Nov 13 11:41 mitoSplitter.hvf2.scanpy.txt 367372 Nov 13 11:41 mitoSplitter.clusters_corr.txt 11699 Nov 13 11:41 mitoSplitter.corr_hist.pdf 220601 Nov 13 11:41 mitoSplitter.clusters_prob.txt 117923 Nov 13 11:41 mitoSplitter.clusters.txt 422071 Nov 13 11:41 mitoSplitter.tSNE.pdf
The parent folder
320 Nov 13 10:54 bulk_af.list
2484300 Nov 13 10:54 bulk_all.af.txt 258 Nov 13 10:54 bulk_all.af.cor.txt 2833606416 Nov 13 11:00 minitagged.bam.sorted_CB 97 Nov 13 11:31 whole_all.A.alt 97 Nov 13 11:31 whole_all.C.alt 97 Nov 13 11:31 whole_all.G.alt 97 Nov 13 11:31 whole_all.T.alt 97 Nov 13 11:31 whole_all.coverage.A.alt 97 Nov 13 11:31 whole_all.coverage.C.alt 97 Nov 13 11:31 whole_all.coverage.G.alt 97 Nov 13 11:31 whole_all.coverage.T.alt 1881949017 Nov 13 11:37 whole_all.af.txt 54606 Nov 13 11:40 Favg_Gaussian_fitted_distribution.pdf 54454 Nov 13 11:40 Favg_Gaussian_fitted_singlet.list 153881 Nov 13 11:40 Favg_Gaussian_fitted_doublet.list 415 Nov 13 11:41 mitoSplitter
You can try using the - p parameter instead, which will use scrublet to remove doublets based on the gene expression profiles.
So the problem seemed to be a syntax thing (with the -g option). We can now run on real (simulated pool) data with good final results, including the -g option.
Thanks for your help
1) We followed the discussion with hernandezvargash to tweak the installation in our cluster, and now the test data runs smoothly and mimics the ouput provided here.
2) With real data (~10000 cells, 4 samples), it runs almost entirely but I'm getting an error at the very end, see below.
`mixtools package, version 2.0.0, Released 2022-12-04 This package is based upon work supported by the National Science Foundation under Grant No. SES-0518772 and the Chan Zuckerberg Initiative: Essential Open Source Software for Science (Grant No. 2020-255193).
/hpc/apps/mitosplitter/20230422/lib/python3.9/site-packages/scanpy/preprocessing/_highly_variable_genes.py:62: UserWarning:
flavor='seurat_v3'
expects raw count data, but non-integers were found. warnings.warn( /hpc/apps/mitosplitter/20230422/lib/python3.9/site-packages/numpy/lib/function_base.py:2829: RuntimeWarning: invalid value encountered in true_divide c /= stddev[:, None] /hpc/apps/mitosplitter/20230422/lib/python3.9/site-packages/numpy/lib/function_base.py:2830: RuntimeWarning: invalid value encountered in true_divide c /= stddev[None, :] /hpc/apps/mitosplitter/20230422/scripts/mitoSplitter_prob.py:118: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access Y_pred_prob.columnsumns = label_coding.keys() /hpc/apps/mitosplitter/20230422/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:780: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2. warnings.warn( /hpc/apps/mitosplitter/20230422/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2. warnings.warn('Traceback (most recent call last): File "/hpc/apps/mitosplitter/20230422/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3629, in get_loc return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'Cluster'
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/hpc/apps/mitosplitter/20230422/scripts/mitoSplitter_prob_info.py", line 26, in
count_labels = cc(true_labels['Cluster'].tolist())
File "/hpc/apps/mitosplitter/20230422/lib/python3.9/site-packages/pandas/core/frame.py", line 3505, in getitem
indexer = self.columns.get_loc(key)
File "/hpc/apps/mitosplitter/20230422/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3631, in get_loc
raise KeyError(key) from err
KeyError: Cluster'
3. The ouput is almost complete, but the cluster files seem to be truncated (~2,000 barcodes only), and the probability files are missing:
-rw-r--r-- 1 2762250 Nov 9 16:29 mitoSplitter.hvf_meta.scanpy.txt -rw-r--r-- 1 14122 Nov 9 16:29 mitoSplitter.hvf.scanpy.txt -rw-r--r-- 1 522942 Nov 9 16:29 mitoSplitter.variant_selection.pdf -rw-r--r-- 1 6441 Nov 9 16:29 mitoSplitter.hvf2.scanpy.txt -rw-r--r-- 1 367372 Nov 9 16:29 mitoSplitter.clusters_corr.txt -rw-r--r-- 1 11699 Nov 9 16:29 mitoSplitter.corr_hist.pdf -rw-r--r-- 1 220604 Nov 9 16:29 mitoSplitter.clusters_prob.txt -rw-r--r-- 1 117923 Nov 9 16:29 mitoSplitter.clusters.txt -rw-r--r-- 1 421675 Nov 9 16:29 mitoSplitter.tSNE.pdf
I'm not sure how to figure this out, hope you can help.
Thanks in advance.