theislab / hadge

Comprehensive pipeline for donor demultiplexing in single cell
https://hadge.readthedocs.io/en/latest/
MIT License
24 stars 6 forks source link

Freemuxlet fails during popscle dsc-pileup due to different order of bam and vcf. #25

Closed bednarsky closed 1 year ago

bednarsky commented 1 year ago

Sorry for having so many questions and thank you in advance for your help!

Freemuxlet fails with this error:

NOTICE [2023/11/05 16:04:12] - Finished loading 6249 droplet/cell barcodes to consider NOTICE [2023/11/05 16:04:12] - Initializing BCF reader.. NOTICE [2023/11/05 16:04:12] - Finished identifying 0 samples to load from VCF/BCF NOTICE [2023/11/05 16:04:12] - Initializing SAM reader..

FATAL ERROR - [E:int32_t cmdCramDigitalPileup(int32_t, char**)] Your VCF/BCF files and SAM/BAM/CRAM files have di fferent ordering of chromosomes. SAM/BAM/CRAM file has GL000225.1 before GL000194.1, but VCF/BCF file has GL000225.1 after GL000194.1

terminate called after throwing an instance of 'pException' what(): Exception was thrown .command.sh: line 36: 1042790 Aborted (core dumped) popscle dsc-pileup --sam sorted .bam --tag-group CB --tag-UMI UB --vcf genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf --sam-verbose 10000 00 --vcf-verbose 10000 --cap-BQ 40 --min-BQ 13 --min-MQ 20 --min-TD 0 --excl-flag 3844 --group-list b arcodes.tsv.gz --min-total 0 --min-uniq 0 --min-snp 0 --out freemuxlet_391/plp/freemuxlet_out

Work dir: /projects/project_dir/resources/demultiplexing/hadge/work/6b/502461a7f29a e39aff076e44409a7d

Do you know this issue? What is your recommended solution? I tried popscle_helper_tools to reorder the vcf file, but it is currently still failing, see issue on that repo.

Also note this line Finished identifying 0 samples to load from VCF/BCF → is this usual, or does this point to an earlier error?

This issue occured for both hg38 (which matches the ref genome used for demultiplexing) and hg19 (which I used first by accident).

Really appreciate all your work!

wxicu commented 1 year ago

Hey here are my suggestions:

  1. https://github.com/statgen/popscle/issues/25#issuecomment-591292450 Maybe try to remove the config completely?
  2. I dont think Finished identifying 0 samples to load from VCF/BCF is really a problem. bcftools is trying to identify the samples from a vcf of common SNPs which don't have any sample lists. So it can not identify any sample.
  3. Could you please try to run the command
    popscle dsc-pileup --sam sorted
    .bam --tag-group CB --tag-UMI UB --vcf genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf --sam-verbose 10000
    00 --vcf-verbose 10000 --cap-BQ 40 --min-BQ 13 --min-MQ 20 --min-TD 0 --excl-flag 3844 --group-list b
    arcodes.tsv.gz --min-total 0 --min-uniq 0 --min-snp 0 --out freemuxlet_391/plp/freemuxlet_out

    outside the nextflow pipeline? Maybe tabix indeed solves the problem but due to the reason that hadge doesnt mount the tbi into the working directory, freemuxlet can not find this file and fails again. That could also be possible.

bednarsky commented 1 year ago

Thanks again for the fast response, amazing support! I will let you know what worked.

Zethson commented 1 year ago

Keep them coming @bednarsky ! Thank you for playing with hadge. We're happy to provide additional assistance if necessary.

wxicu commented 1 year ago

Please feel free to reopen this issue if you are still running into issues and we would be happy to help troubleshoot further.

bednarsky commented 1 year ago

I had an issue with the header and the table of vcf files using different formatting. I changed all formatting to the one in the bam files, namely ch1, chr2 etc, and then the reordering with popscle_helper_tools worked, and the process continued.

These are my commands to download and prepare the files.

wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=13aebUpEKrtjliyT9rYzRijtkNJVUk5F_' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=13aebUpEKrtjliyT9rYzRijtkNJVUk5F_" -O common_variants_grch38.vcf && rm -rf /tmp/cookies.txt

# rename chromosomes to match bams, need to change header back first so that then bcftools works
sed 's/ID=chr/ID=/g' common_variants_grch38.vcf > common_variants_grch38.no_chr.vcf

# create file for renaming, then do renaming
for i in {1..22} X Y; do echo -e "$i\tchr$i"; done > rename_chrs.txt
bcftools annotate --rename-chrs rename_chrs.txt -o common_variants_grch38.vcf common_variants_grch38.no_chr.vcf

# adapt for scSplit like done in example from repo
bcftools query -f '%CHROM:%POS\n' common_variants_grch38.vcf > common_variants_grch38_list.vcf

# freemuxlet and cellsnp
wget https://master.dl.sourceforge.net/project/cellsnp/SNPlist/genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf.gz
gunzip genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf.gz

# do the renaming
bcftools annotate --rename-chrs rename_chrs.txt -o genome1K.phase3.SNP_AF5e2.chr1toX.hg38.chr.vcf genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf

# reorder
popscle_helper_tools/sort_vcf_same_as_bam.sh \
    tmp_dir_example_bam/example_bam.bam \
     genome1K.phase3.SNP_AF5e2.chr1toX.hg38.chr.vcf > genome1K.phase3.SNP_AF5e2.chr1toX.hg38.chr.reordered.vcf

Removing the contig lines, as proposed here, https://github.com/statgen/popscle/issues/25#issuecomment-591292450, led to different downstream errors.