jon-xu / scSplit

Genotype-free demultiplexing of pooled single-cell RNA-Seq, using a hidden state model for identifying genetically distinct samples within a mixed population.
MIT License
39 stars 9 forks source link

Problem building allele count matrices #7

Closed ccruizm closed 4 years ago

ccruizm commented 4 years ago

Good day!

I have given a try to your tools but have found some issues. When I run the command to build the allele count matrix after obtaining the SNV from freebayes, I have a wrong output and some errors.

At the beginning of the pipeline:

Num Pos: 13814 , Num barcodes: 5303
/home/cruiz/anaconda3/envs/scSplit/lib/python3.6/site-packages/sklearn/decomposition/pca.py:447: RuntimeWarning: invalid value encountered in true_divide
  explained_variance_ratio_ = explained_variance_ / total_var
/home/cruiz/anaconda3/envs/scSplit/lib/python3.6/site-packages/sklearn/decomposition/pca.py:447: RuntimeWarning: invalid value encountered in true_divide
  explained_variance_ratio_ = explained_variance_ / total_var
/home/cruiz/anaconda3/envs/scSplit/lib/python3.6/site-packages/sklearn/decomposition/pca.py:447: RuntimeWarning: invalid value encountered in true_divide
  explained_variance_ratio_ = explained_variance_ / total_var
/home/cruiz/anaconda3/envs/scSplit/lib/python3.6/site-packages/sklearn/decomposition/pca.py:447: RuntimeWarning: invalid value encountered in true_divide
  explained_variance_ratio_ = explained_variance_ / total_var

and at the end:

/home/cruiz/anaconda3/envs/scSplit/lib/python3.6/site-packages/sklearn/cluster/k_means_.py:972: ConvergenceWarning: Number of distinct clusters (1) found smaller than n_clusters (3). Possibly due to duplicate points in X.
  return_n_iter=True)
/home/cruiz/anaconda3/envs/scSplit/lib/python3.6/site-packages/sklearn/decomposition/pca.py:447: RuntimeWarning: invalid value encountered in true_divide
  explained_variance_ratio_ = explained_variance_ / total_var
/home/cruiz/anaconda3/envs/scSplit/lib/python3.6/site-packages/sklearn/cluster/k_means_.py:972: ConvergenceWarning: Number of distinct clusters (1) found smaller than n_clusters (3). Possibly due to duplicate points in X.
  return_n_iter=True)
/home/cruiz/anaconda3/envs/scSplit/lib/python3.7/site-packages/scSplit/scSplit:264: FutureWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  alt_or_ref = alt_or_ref.ix[[x for x in alt_or_ref.index if x[0] not in ['X','Y','MT']]]

It generates empty files (alt and ref_filtered) and other matrices. These files are from libraries that were shallowed sequenced (no more than 30M reads and about 6K cells per lib). Do you think that might be also an issue for the algorithm to work properly?

Thanks in advance for your help!

jon-xu commented 4 years ago

Hi Cristian,

Thanks for your interest in our tool!

Reads per cell = 5K should be ok for scSplit - we’ve tested on even shallower data.

As to your problem, could you please show me the commands you ran? It seems you ran “scSplit run” as well.

For “scSplit count”, one possible reason that the matrices are not built properly could be the barcode tags in your BAM file - the default is CB:Z. Could you please have a check?

Otherwise, please send me the scSplit.log and also a glimpse into your vcf file as well as BAM file would be helpful for me to understand what’s going on.

Cheers

ccruizm commented 4 years ago

Thanks for your quick reply @jon-xu

This is the bash script I am running:

#!/bin/bash

export PATH=/home/cruiz/10x_scRNAseq/Cell-ranger-3.0.2/subset-bam-1.0-x86_64-linux:$PATH

subset-bam --bam /home/cruiz/possorted_genome_bam.bam \
           --cell-barcodes /home/cruiz/barcodes.tsv \ # from cellreanger output (filtered)
           --out-bam filtered.bam

samtools markdup filtered.bam markdup.bam
samtools index markdup.bam

freebayes -f /home/cruiz/10x_scRNAseq/Cell-ranger-3.0.2/refdata-cellranger-hg19-3.0.0/fasta/genome.fa -iXu -C 2 -q 1 markdup.bam > snv.vcf

/home/cruiz/anaconda3/envs/scSplit/lib/python3.7/site-packages/scSplit/scSplit count -v snv.vcf \
              -i markdup.bam \
              -b /home/cruiz/barcodes.tsv \
              -c /home/cruiz/10x_scRNAseq/markers-and-databases/common_snvs_hg19 \
              -r ref_filtered.csv \
              -a alt_filtered.csv

/home/cruiz/anaconda3/envs/scSplit/lib/python3.7/site-packages/scSplit/scSplit run -r ref_filtered.csv \
            -a alt_filtered.csv \
            -n 2 # I have only two different samples in one lib

The BAM file contains the CB:Z tag

NS500173:544:HNLWYBGXC:2:12205:15060:6104   272 1   12030   0   56M *   0   0   GCTGGCCTGTGCCAGGGTGCAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAG    EEEEEEEEEEEEEEEEE/EEEEEE/EEEEEEAEEEEAEEEE/EEEEEEEAEAAAAA    NH:i:6  HI:i:2  AS:i:55 nM:i:0  RE:A:I  li:i:0  BC:Z:CAGCCACT   QT:Z:AAAAA6EECR:Z:TACCCGTGTTGCTTGA  CY:Z:AAAAAEEEEEEEEEEE   CB:Z:TACCCGTGTTGCTTGA-1 UR:Z:AGAAGTAAATGC   UY:Z:EEEEEEEEEEEE   UB:Z:AGAAGTAAATGC   RG:Z:A1-10x_2366-2069:0:1:HNLWYBGXC:2
NS500173:544:HNLWYBGXC:2:21211:15528:13523  272 1   12035   0   56M *   0   0   CCTGTGCCAGGGTGGAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCC    EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA    NH:i:6  HI:i:5  AS:i:53 nM:i:1  RE:A:I  li:i:0  BC:Z:TGATTCTA   QT:Z:AAAAAEEECR:Z:TACCCGTGTTAGGGAC  CY:Z:AAAAAEEEEEEEEEEE   CB:Z:TACCCGTGTTAGGGAC-1 UR:Z:GCGTAATTATAC   UY:Z:EEEEEEEEAEEE   UB:Z:GCGTAATTATAC   RG:Z:A1-10x_2366-2069:0:1:HNLWYBGXC:2

I do not know where the error is coming from. I hope we can find it.

Thanks!

snv_overview.txt

scSplit.log

jon-xu commented 4 years ago

Hi Cristian,

Thanks for providing the detail information!

Your bash script and VCF look fine to me.

The only thing I can notice here is the BAM reads, I am not sure whether all your reads have a FLAG > 256 - in the example you have FLAG=272. ScSplit filtered out reads with FLAG > 256 so that we only focus on the valid ones.

But based on the log file you sent me the matrices should be built and the problem happened during the initialisation of the model in "scSplit run".

Would it be possible to send me your ref_filtered.csv and alt_filtered.csv, please?

Cheers

jon-xu commented 4 years ago

If you are using an old version (<= v0.8.2), empty matrices are allowed to be built. And if you have empty matrices, please check the BAM file according to my notes above. Otherwise, please send the matrices to me for a check. Cheers.

ccruizm commented 4 years ago

Hello @jon-xu

I checked and you were right. Many reads in the BAM file were >256. I included in the script, after subset-bam some of the lines of code you have in the explanation of the tool:

samtools view -S -b -q 10 -F 3844 filtered_pre.bam > filtered.bam
samtools markdup filtered.bam markdup_pre.bam
samtools sort -o markdup.bam markdup_pre.bam
samtools index markdup.bam

After filtering, the BAM file is half the size and reads have FLAG <256 (mostly 16 and 0). Then I used the markdup.bam as input for freebayes using the same script I sent before. But still, I have the same problem. I am using the v1.0.0.

I had an additional message to the ones i shared before:

/home/cruiz/anaconda3/envs/scSplit/lib/python3.6/site-packages/sklearn/decomposition/pca.py:447: RuntimeWarning: invalid value encountered in true_divide
  explained_variance_ratio_ = explained_variance_ / total_var
/home/cruiz/anaconda3/envs/scSplit/lib/python3.6/site-packages/sklearn/cluster/k_means_.py:972: ConvergenceWarning: Number of distinct clusters (1) found smaller than n_clusters (3). Possibly due to duplicate points in X.
  return_n_iter=True)

Attached the output files (using scSplit run -n 0 or 2). Thanks again for your help! matrices_n2.zip matrices_n0.zip

jon-xu commented 4 years ago

Hi Cristian,

Thanks for the information!

This looks a bit mysterious to me: First, scSplit wouldn’t generate the matrices if they are empty, it should raise a value error. Second, I don’t know why the matrices are empty, if the tag is correct.

Is there any possibility that you can share your BAM and vcf with me, please? I would be able to look into it then.

Cheers

On 19 11 2019, at 01:15, Cristian notifications@github.com<mailto:notifications@github.com> wrote:

Hello @jon-xuhttps://github.com/jon-xu

I checked and you were right. Many reads in the BAM file were >256. I included in the script, after subset-bam some of the lines of code you have in the explanation of the tool:

samtools view -S -b -q 10 -F 3844 filtered_pre.bam > filtered.bam samtools markdup filtered.bam markdup_pre.bam samtools sort -o markdup.bam markdup_pre.bam samtools index markdup.bam

After filtering, the BAM file is half the size and reads have FLAG <256 (mostly 16 and 0). Then I used the markdup.bam as input for freebayes using the same script I sent before. But still, I have the same problem. I am using the v1.0.0.

I had an additional message to the ones i shared before:

/home/cruiz/anaconda3/envs/scSplit/lib/python3.6/site-packages/sklearn/decomposition/pca.py:447: RuntimeWarning: invalid value encountered in true_divide explained_varianceratio = explainedvariance / total_var /home/cruiz/anaconda3/envs/scSplit/lib/python3.6/site-packages/sklearn/cluster/kmeans.py:972: ConvergenceWarning: Number of distinct clusters (1) found smaller than n_clusters (3). Possibly due to duplicate points in X. return_n_iter=True)

Attached the output files (using scSplit run -n 0 or 2). Thanks again for your help! matrices_n2.ziphttps://github.com/jon-xu/scSplit/files/3859254/matrices_n2.zip matrices_n0.ziphttps://github.com/jon-xu/scSplit/files/3859255/matrices_n0.zip

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/jon-xu/scSplit/issues/7?email_source=notifications&email_token=AJXMHDAB3HLITTOAWHWDMWTQUKWP3A5CNFSM4JNJCYCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEKYRDQ#issuecomment-555059342, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJXMHDHFFSIIDR63B3K66MDQUKWP3ANCNFSM4JNJCYCA.

jon-xu commented 4 years ago

Another thing you might want to check is that the barcodes in the file you specify for scSplit count should exist in the BAM reads, otherwise no reads will be recorded, as the barcode file is a white list to make sure we don't get those reads with sequencing error on barcodes.

However, this still doesn't explain why the empty matrices will be output, which is disabled in scSplit. Please send me your bam, vcf and barcodes and I'll help to check.

Cheers.

ccruizm commented 4 years ago

That would be great @jon-xu !

How may I share the files with you? Thanks!

jon-xu commented 4 years ago

Hi Cristian,

Please let me know your email address and I can setup something for you.

Cheers

ccruizm commented 4 years ago

Sure! ccruizm.mv@gmail.com

Thanks!

jon-xu commented 4 years ago

Hi Cristian,

Thanks for sharing your data files with me.

I had a look and got below results for your reference:

1) I ran "python3 scSplit count -i markdup.bam -b barcodes.tsv -v snv.vcf -r ref_filtered.csv -a alt_filtered.csv" and got the ref/alt count matrices, and they are not empty, although very low in counts, e.g., only 96536 sum of counts in ref_filtered.csv. 2) I checked back to markdup.bam and found the valid reads (according to -F 3844) are very limited, just 1000+ reads/cell. 3) I ran "python3 scSplit run -r ref_filtered.csv -a alt_filtered.csv -n 2 -d 0", and I got demultiplexing results (as attached).

results.zip

I am not sure what caused the different results for matrices building and demultiplexing between us, but it would be good if you could cleanse your working directory, download the latest scSplit release and try it again.

Furthermore, considering the fact that your BAM contains a lot of low quality reads (FLAG > 256) and the total valid read count is too low (normally between 3000 to 30000+ reads/cell), I suggest you try to find a way to improve your read quality.

Hope it helps.

Cheers.

ccruizm commented 4 years ago

Hello @jon-xu

Thank you for your help. I have created a fresh conda env with the last version of scSplit (last time I had v1.0.0). Below the packages in the conda env:

# packages in environment at /home/cruiz/anaconda3/envs/sc-split:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main    conda-forge
bzip2                     1.0.8                h516909a_1    conda-forge
ca-certificates           2019.11.28           hecc5488_0    conda-forge
certifi                   2019.11.28               py37_0    conda-forge
curl                      7.65.3               hf8cf82a_0    conda-forge
freebayes                 1.3.1            py37h56106d0_0    bioconda
htslib                    1.9                  h244ad75_9    bioconda
joblib                    0.14.0                     py_0    conda-forge
krb5                      1.16.3            h05b26f9_1001    conda-forge
libblas                   3.8.0               14_openblas    conda-forge
libcblas                  3.8.0               14_openblas    conda-forge
libcurl                   7.65.3               hda55be3_0    conda-forge
libdeflate                1.3                  h516909a_0    conda-forge
libedit                   3.1.20170329      hf8c457e_1001    conda-forge
libffi                    3.2.1             he1b5a44_1006    conda-forge
libgcc-ng                 9.2.0                hdf63c60_0    conda-forge
libgfortran-ng            7.3.0                hdf63c60_2    conda-forge
liblapack                 3.8.0               14_openblas    conda-forge
libopenblas               0.3.7                h6e990d7_3    conda-forge
libssh2                   1.8.2                h22169c7_2    conda-forge
libstdcxx-ng              9.2.0                hdf63c60_0    conda-forge
ncurses                   6.1               hf484d3e_1002    conda-forge
numpy                     1.17.3           py37h95a1406_0    conda-forge
openssl                   1.1.1d               h516909a_0    conda-forge
pandas                    0.25.3           py37hb3f55d8_0    conda-forge
parallel                  20160622                      1    bioconda
perl                      5.26.2            h516909a_1006    conda-forge
perl-threaded             5.26.0                        0    bioconda
pip                       19.3.1                   py37_0    conda-forge
pysam                     0.15.3           py37hbcae180_3    bioconda
python                    3.7.3                h33d41f4_1    conda-forge
python-dateutil           2.8.1                      py_0    conda-forge
pytz                      2019.3                     py_0    conda-forge
pyvcf                     0.6.8                 py37_1000    conda-forge
readline                  8.0                  hf8c457e_0    conda-forge
samtools                  1.9                 h10a08f8_12    bioconda
scikit-learn              0.21.3           py37hcdab131_0    conda-forge
scipy                     1.3.2            py37h921218d_0    conda-forge
scsplit                   1.0.2                    pypi_0    pypi
setuptools                42.0.2                   py37_0    conda-forge
six                       1.13.0                   py37_0    conda-forge
sqlite                    3.30.1               hcee41ef_0    conda-forge
tk                        8.6.10               hed695b0_0    conda-forge
wheel                     0.33.6                   py37_0    conda-forge
xz                        5.2.4             h14c3975_1001    conda-forge
zlib                      1.2.11            h516909a_1006    conda-forge

I ran the pipeline again and now I am having a different issue:

terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr: __pos (which is 135007858) > this->size() (which is 135006516)
./scsplit.sh: line 15: 51990 Aborted                 freebayes -f /home/cruiz/10x_scRNAseq/Cell-ranger-3.0.2/refdata-cellranger-hg19-3.0.0/fasta/genome.fa -iXu -C 2 -q 1 markdup.bam > snv.vcf
Num Pos: 34 , Num barcodes: 5569
/home/cruiz/anaconda3/envs/sc-split/lib/python3.7/site-packages/sklearn/decomposition/pca.py:447: RuntimeWarning: invalid value encountered in true_divide
  explained_variance_ratio_ = explained_variance_ / total_var
/home/cruiz/anaconda3/envs/sc-split/lib/python3.7/site-packages/sklearn/cluster/k_means_.py:972: ConvergenceWarning: Number of distinct clusters (1) found smaller than n_clusters (2). Possibly due to duplicate points in X.
  return_n_iter=True)
Traceback (most recent call last):
  File "/home/cruiz/anaconda3/envs/sc-split/lib/python3.7/site-packages/scSplit/scSplit", line 642, in <module>
    scSplit()
  File "/home/cruiz/anaconda3/envs/sc-split/lib/python3.7/site-packages/scSplit/scSplit", line 354, in __init__
    getattr(self, args.command)()
  File "/home/cruiz/anaconda3/envs/sc-split/lib/python3.7/site-packages/scSplit/scSplit", line 525, in run
    model.distinguishing_alleles(pos)
  File "/home/cruiz/anaconda3/envs/sc-split/lib/python3.7/site-packages/scSplit/scSplit", line 278, in distinguishing_alleles
    alt_or_ref = ((N_alt_mtx >= 10) * 1 - ((N_ref_mtx >= 10) & (N_alt_mtx == 0)) * 1).drop(self.doublet, axis=1).astype(np.int8)
  File "/home/cruiz/anaconda3/envs/sc-split/lib/python3.7/site-packages/pandas/core/frame.py", line 4117, in drop
    errors=errors,
  File "/home/cruiz/anaconda3/envs/sc-split/lib/python3.7/site-packages/pandas/core/generic.py", line 3914, in drop
    obj = obj._drop_axis(labels, axis, level=level, errors=errors)
  File "/home/cruiz/anaconda3/envs/sc-split/lib/python3.7/site-packages/pandas/core/generic.py", line 3946, in _drop_axis
    new_axis = axis.drop(labels, errors=errors)
  File "/home/cruiz/anaconda3/envs/sc-split/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 5340, in drop
    raise KeyError("{} not found in axis".format(labels[mask]))
KeyError: '[-1] not found in axis'

I used a library that was sequenced deeper and would expect a higher number of valid read counts as you suggested before. May I send you the files I used this time, so you can have a look at them, please?

Thanks again

jon-xu commented 4 years ago

Hi Cristian,

Thanks for sharing the information!

scSplit does not require a very deep sequencing read - we tested on 3,000 to 30,000 rpc, which should be sufficient for most normal sequencing.

The first error was from your freebayes call? Maybe you wanna check your reference genome, rebuild your FASTQ index, or check for other workarounds, e.g. github.com › ekg › freebayes › issues › 205improper handling of Ns in alignment · Issue #205 · ekg …https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=2ahUKEwjV5dXe0JfmAhWFF3IKHRT8De4QFjAAegQIARAH&url=https%3A%2F%2Fgithub.com%2Fekg%2Ffreebayes%2Fissues%2F205&usg=AOvVaw1pfkRva7-kdi1KkGzsHnks

I see you’ve only got 34 variants from your freebayes call, this might be too less for a proper scSplit run unless they are distinguishing variants - usually you are looking for something between 10,000 to 50,000 SNP variants.

The last error looks weird to me, it seems you are running "scSplit run" with "-d 0” maybe with a version with a bug which was fixed in v1.0.2? Could you please share with me your currently using scSplit file (in your package folder) and the command you ran?

Cheers, Jon