kharchenkolab / numbat

Haplotype-aware CNV analysis from single-cell RNA-seq
https://kharchenkolab.github.io/numbat/
Other
163 stars 23 forks source link

Problem with pileup_and_phase; cM values NaN for all chromosomes but Chr1 and Chr2 #78

Open dejonggr opened 1 year ago

dejonggr commented 1 year ago

Hi there,

I've been struggling to fix an error with your tool RE: missed pHF data. It seems numbat only pHF data for chromosomes 1-2 and filters all other SNPs - probably due to all cM values for others chromosomes having NaN values. The allele counts tsv itself has SNP data for all expected genes and chromosomes, but it seems cM values are not calculated after chromosome 2.

Note:

It seems like this is an incompatibility between the phasing df and the gmap dataframe but I'm not entirely sure why or when this is breaking down.

Looking to the relevant code for pileup_and_phase, I think the NaNs stem from the left join with the gmap:

 # annotate SNPs by genetic map
    marker_map = GenomicRanges::findOverlaps(
            vcf_phased %>% {GenomicRanges::GRanges(
                seqnames = .$CHROM,
                IRanges::IRanges(start = .$POS,
                    end = .$POS)
            )},
            gmap %>% {GenomicRanges::GRanges(
                seqnames = .$CHROM,
                IRanges::IRanges(start = .$start,
                    end = .$end)
            )}
        ) %>%
        as.data.frame() %>%
        setNames(c('marker_index', 'map_index')) %>%
        left_join(
            vcf_phased %>% mutate(marker_index = 1:n()) %>%
                select(marker_index, snp_id),
            by = c('marker_index')
        ) %>%
        left_join(
            gmap %>% mutate(map_index = 1:n()),
            by = c('map_index')
        ) %>%
        arrange(marker_index, -start) %>%
        distinct(marker_index, `.keep_all` = TRUE) %>%
        select(snp_id, cM)

But I don't exactly understand why they aren't merging since the VCF chromosome notation is the same between chr2/3 and chr3 isn't merging.

For context, heres my initial command:

Rscript /numbat/inst/bin/pileup_and_phase.R \
    --label $sample \
    --samples $sample \
    --bams bam_files/$sample/gex_possorted_bam.bam \
    --barcodes $sample/filtered_feature_bc_matrix/barcodes.tsv.gz \
    --outdir numbat/$block_id/$sample \
    --gmap /Eagle_v2.4.1/tables/genetic_map_hg38_withX.txt.gz \
    --snpvcf /data/genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf \
    --paneldir /data/1000G_hg38 \
    --ncores 6

Output from pileup and phase:

Using genome version: hg38
[I::main] start time: 2022-11-24 15:23:39
[W::check_args] Max depth set to maximum value (2147483647)
[W::check_args] Max pileup set to maximum value (2147483647)
[I::main] global settings after checking:
        num of input files = 1
        out_dir = cellranger-arc201_count_3870f30ab56fc96cb0b9cd6fa1687f7e/pileup/cellranger-arc201_count_3870f30ab56fc96cb0b9cd6fa1687f7e
        is_out_zip = 0, is_genotype = 0
        is_target = 0, num_of_pos = 0
        num_of_barcodes = 5505, num_of_samples = 0
        refseq file = (null)
        0 chroms: 
        cell-tag = CB, umi-tag = UB
        nthreads = 6, tp_max_open = 131072
        mthreads = 6, tp_errno = 0, tp_ntry = 0
        min_count = 2, min_maf = 0.00, double_gl = 0
        min_len = 30, min_mapq = 20
        rflag_filter = 772, rflag_require = 0
        max_depth = 2147483647, max_pileup = 2147483647, no_orphan = 1
[I::main] All Done!
[I::main] Version: 1.2.2 (htslib 1.16)
[I::main] CMD: cellsnp-lite -s gex_possorted_bam.bam -b barcodes.tsv.gz -O out -p 6 --minMAF 0 --minCOUNT 2 --UMItag Auto --cellTAG CB
[I::main] end time: 2022-11-24 18:32:21
[I::main] time spent: 11322 seconds.
              --> WARNING: Check REF/ALT agreement between target and ref? <--
There were 15 warnings (use warnings() to see them)
teng-gao commented 1 year ago

Interesting .. I'm not sure why this is happening but I would try to read in the VCF files and step through the genotyping function until the marker_map part:

https://github.com/kharchenkolab/numbat/blob/main/inst/bin/pileup_and_phase.R#L214-L249 https://github.com/kharchenkolab/numbat/blob/main/R/genotyping.R#L153-L293

If you are still having trouble let me know.

dejonggr commented 1 year ago

So I've run the function preprocess_allele from genotyping.R line by line using the output from pileup and phasing directories and marker_map contains the expected output (i.e. cM info for all SNPs). I'm assuming every step before this completed successfully as the results are as expected.

I'm not sure what's going on but maybe it's a problem with the gmap accessible in Singularity? The only difference is that I ran pileup_and_phase.R via singularity run and instead of using the gmap in the container, I copied it locally.