statgen / popscle

A suite of population scale analysis tools for single-cell genomics data including implementation of Demuxlet / Freemuxlet methods and auxilary tools
https://github.com/statgen/popscle/wiki
Apache License 2.0
43 stars 16 forks source link

missing barcodes in freemuxlet output #40

Open lb15 opened 3 years ago

lb15 commented 3 years ago

Hi,

I'm using dsc-pileup and freemuxlet to identify donors/doublets of a two-donor mixture. I pre-filter the bam file using the popscle_helper_tools (https://github.com/aertslab/popscle_helper_tools) linked in other github issues. I find that for some of my samples, the output from freemuxlet (.best file) does not contain all the barcodes in the original barcodes.tsv file. I can see in the log file that all the barcodes are read into dsc-pileup, but slightly fewer are read into freemuxlet. I think they are being filtered in dsc-pileup, but can't find any info about why. Is this expected behavior and is there a way to track why these cells are removed?

Thanks!

Here are some relevant snippets of the log file:

The following parameters are available. Ones with "[]" are in effect:
           Options for input SAM/BAM/CRAM : --sam [A_4dpi_MOCK_1_possorted_genome_filtered_bam.bam],
                                            --tag-group [CB], --tag-UMI [UB]
                Options for input VCF/BCF : --vcf [ucsc.hg38.liftover.out.nochr.c1_22.nohbb_renamed.vcf.gz],
                                            --sm, --sm-list
                           Output Options : --out [A_4dpi_MOCK_1_pileup_filteredbam],
                                            --sam-verbose [1000000],
                                            --vcf-verbose [10000], --skip-umi
   SNP-overlapping Read filtering Options : --cap-BQ [40], --min-BQ [13],
                                            --min-MQ [20], --min-TD,
                                            --excl-flag [3844]
           Cell/droplet filtering options : --group-list [barcodes.tsv],
                                            --min-total, --min-uniq, --min-snp

Run with --help for more detailed help messages of each argument.

NOTICE [2021/02/08 17:29:21] - Finished loading 16695 droplet/cell barcodes to consider
NOTICE [2021/02/08 17:29:21] - Initializing BCF reader..
NOTICE [2021/02/08 17:29:21] - Finished identifying 0 samples to load from VCF/BCF
NOTICE [2021/02/08 17:29:21] - Initializing SAM reader..
NOTICE [2021/02/08 17:29:21] - Reading 0 reads at chr1:1 and skipping 0
NOTICE [2021/02/08 17:30:36] - Reading 1000000 reads at chr1:6197681 and skipping 482659
NOTICE [2021/02/08 17:31:19] - WARNING: Cannot find UMI tag UB from 1981277-th read A00111:625:HT7HLDSXY:3:2666:19361:10160 at chr1:15734312-1573443
3. Treating all of them as a single UMI

NOTICE [2021/02/08 22:45:32] - Finished reading 216797 markers from the VCF file
NOTICE [2021/02/08 22:45:32] - Total number input reads : 395101280
NOTICE [2021/02/08 22:45:32] - Total number of read-QC-passed reads : 206083344 
NOTICE [2021/02/08 22:45:32] - Total number of skipped reads with ignored barcodes : 0
NOTICE [2021/02/08 22:45:32] - Total number of non-skipped reads with considered barcodes : 206083344
NOTICE [2021/02/08 22:45:32] - Total number of gapped/noninformative reads : 106047105
NOTICE [2021/02/08 22:45:32] - Total number of base-QC-failed reads : 2583351
NOTICE [2021/02/08 22:45:32] - Total number of redundant reads : 6992853
NOTICE [2021/02/08 22:45:32] - Total number of pass-filtered reads : 90460035
NOTICE [2021/02/08 22:45:32] - Total number of pass-filtered reads overlapping with multiple SNPs : 15008085
NOTICE [2021/02/08 22:45:32] - Starting to prune out cells with too few reads...
NOTICE [2021/02/08 22:45:32] - Finishing pruning out 0 cells with too few reads...
NOTICE [2021/02/08 22:45:32] - Writing cell information
NOTICE [2021/02/08 22:45:32] - Finished writing cell information
NOTICE [2021/02/08 22:45:32] - Build indices for sparse matrix
NOTICE [2021/02/08 23:04:51] - Finished builing indices for sparse matrix
Finished dsc-pileup

and reading into freemuxlet. Can see that it is processing 16,616 droplets instead of the 16,695 that go into dsc-pileup.

The following parameters are available. Ones with "[]" are in effect:
            Options for input pileup : --plp [A_4dpi_MOCK_1_pileup_filteredbam],
                                       --init-cluster
                      Output Options : --out [A_4dpi_MOCK_1_freemuxlet],
                                       --nsample [2], --aux-files,
                                       --verbose [100]
   Options for statistical inference : --doublet-prior [0.50],
                                       --geno-error [0.10], --bf-thres [5.41],
                                       --frac-init-clust [1.00],
                                       --iter-init [10], --keep-init-missing,
                                       --randomize-singlet-score, --seed
              Read filtering Options : --cap-BQ [20], --min-BQ [13]
      Cell/droplet filtering options : --group-list, --min-total, --min-umi,
                                       --min-snp

Run with --help for more detailed help messages of each argument.

NOTICE [2021/02/08 23:13:21] - Loading pileup information with prefix A_4dpi_MOCK_1_pileup_filteredbam
NOTICE [2021/02/08 23:13:21] - Reading barcode information from A_4dpi_MOCK_1_pileup_filteredbam.cel.gz..
NOTICE [2021/02/08 23:13:21] - Finished loading 16616 droplets, skipping 0
NOTICE [2021/02/08 23:13:21] - Reading variant information from A_4dpi_MOCK_1_pileup_filteredbam.var.gz..
NOTICE [2021/02/08 23:13:21] - Finished loading 216797 variants..
NOTICE [2021/02/08 23:13:21] - Reading pileup information from A_4dpi_MOCK_1_pileup_filteredbam.plp.gz..
NOTICE [2021/02/08 23:15:31] - Finished loading 111986528 UMIs in total..
ShaiberAlon commented 1 year ago

I am running into the same issue.

@lb15 , did you ever get to the bottom of this?

@Griffan , @hyunminkang , thank you for developing this tool! Do you have any insight on this issue?

hyunminkang commented 1 year ago

Maybe the droplets have no reads observed overlapping with SNPs?

valentinaOpazo commented 2 weeks ago

@ShaiberAlon @lb15 Did you find a way to solve it? I'm into the same trouble