statgen / popscle

A suite of population scale analysis tools for single-cell genomics data including implementation of Demuxlet / Freemuxlet methods and auxilary tools
https://github.com/statgen/popscle/wiki
Apache License 2.0
43 stars 15 forks source link

dsc-pileup running progressively slower for a single sample #22

Open xAZx opened 4 years ago

xAZx commented 4 years ago

Hi All,

Happy New Year! I am having a similar issue to the other active thread, regarding dsc-pileup running slow. It starts off fine, and for some reason just gets gradually more and more slow:

https://files.slack.com/files-pri/T02SU7LHA-FS739HWTE/image_from_ios.jpg

As you can see, at each line, the times in between are getting longer and longer. This example has just been running for a short period of time, but I tried running it last week and it was running for about 5 days without finishings. Sometimes there is about 5 hours in between lines! I'm not sure why it is getting progressively slower as the program runs. Just as an fyi, I have run demuxlet on the same linux server on all the same samples in the past with no issues. I am trying to run dsc-pileup now so that I can run freemuxlet after. Many thanks in advance for all your help.

xAZx commented 4 years ago

Some additional info: running on a Linux server with 64 GB RAM. And this is my command:

~/popscle/bin/popscle dsc-pileup --sam /mnt/usb-storage/ihg-client.ucsf.edu/yej/190627_A00269_0205_BHJ7HFDMXX_fastqs_analysis/MS-TCZ-1_pool_1-1/possorted_genome_bam.bam --vcf ~/ALL_SingleCell_data/Tocilizumab/080919-TCZ-genotyping/ucsc.hg38.liftover.out.nochr.vcf --group-list /mnt/usb-storage/ihg-client.ucsf.edu/yej/190627_A00269_0205_BHJ7HFDMXX_fastqs_analysis/MS-TCZ-1_pool_1-1/filtered_feature_bc_matrix/barcodes.tsv --out /mnt/usb-storage/ihg-client.ucsf.edu/yej/190627_A00269_0205_BHJ7HFDMXX_fastqs_analysis/MS-TCZ-1_pool_1-1/pileup

hyunminkang commented 4 years ago

I think it is hitting the memory limit and thrashing seems happening. I cannot see the image in the link to confirm. Can you send?

Hyun.

Hyun Min Kang, Ph.D. Associate Professor of Biostatistics University of Michigan, Ann Arbor Email : hmkang@umich.edu

On Thu, Jan 2, 2020 at 1:23 PM xAZx notifications@github.com wrote:

Some additional info: running on a Linux server with 64 GB RAM. And this is my command:

~/popscle/bin/popscle dsc-pileup --sam /mnt/usb-storage/ ihg-client.ucsf.edu/yej/190627_A00269_0205_BHJ7HFDMXX_fastqs_analysis/MS-TCZ-1_pool_1-1/possorted_genome_bam.bam --vcf ~/ALL_SingleCell_data/Tocilizumab/080919-TCZ-genotyping/ucsc.hg38.liftover.out.nochr.vcf --group-list /mnt/usb-storage/ ihg-client.ucsf.edu/yej/190627_A00269_0205_BHJ7HFDMXX_fastqs_analysis/MS-TCZ-1_pool_1-1/filtered_feature_bc_matrix/barcodes.tsv --out /mnt/usb-storage/ ihg-client.ucsf.edu/yej/190627_A00269_0205_BHJ7HFDMXX_fastqs_analysis/MS-TCZ-1_pool_1-1/pileup

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/statgen/popscle/issues/22?email_source=notifications&email_token=ABPY5OK7SITA3OBF6FOZFWTQ3YWJ7A5CNFSM4KCFGDB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH7AJ2Y#issuecomment-570295531, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPY5OOAECMCGTHPDYX7DWLQ3YWJ7ANCNFSM4KCFGDBQ .

xAZx commented 4 years ago

for some reason I am having trouble getting a link to work so I will send it to your email

ghuls commented 4 years ago

@xAZx I might have a solution for your problem.

In my case dsc-pileup was very slow (took for some test samples 200 hours).

So I made the following: https://github.com/aertslab/popscle_helper_tools which let me run dsc-pileup on the filtered BAM file in only 20 minutes.

$ ./filter_bam_file_for_popscle_dsc_pileup.sh
Usage:   filter_bam_file_for_popscle_dsc_pileup input_bam_filename barcodes_tsv_filename vcf_filename output_bam_filename

Purpose: Filter BAM file for usage with dsc-pileup of popscle by keeping reads that:
           - overlap with SNPs in the VCF file
           - and have a cell barcode contained in the cell barcode list
         Keeping only relevant reads for dsc-pileup can speedup it up several hunderd times.

So for your sample, the following should work.

# Create filtered BAM with only the reads dsc-pileup needs.
./filter_bam_file_for_popscle_dsc_pileup.sh \
    /mnt/usb-storage/ihg-client.ucsf.edu/yej/190627_A00269_0205_BHJ7HFDMXX_fastqs_analysis/MS-TCZ-1_pool_1-1/possorted_genome_bam.bam \
    /mnt/usb-storage/ihg-client.ucsf.edu/yej/190627_A00269_0205_BHJ7HFDMXX_fastqs_analysis/MS-TCZ-1_pool_1-1/filtered_feature_bc_matrix/barcodes.tsv \
    ~/ALL_SingleCell_data/Tocilizumab/080919-TCZ-genotyping/ucsc.hg38.liftover.out.nochr.vcf \
    /tmp/MS-TCZ-1_pool_1-1.filter_bam_file_for_popscle_dsc_pileup.bam

# Use filtered BAM file for dsc-pileup.
~/popscle/bin/popscle dsc-pileup \
    --sam /tmp/MS-TCZ-1_pool_1-1.filter_bam_file_for_popscle_dsc_pileup.bam \
    --vcf ~/ALL_SingleCell_data/Tocilizumab/080919-TCZ-genotyping/ucsc.hg38.liftover.out.nochr.vcf \
    --group-list /mnt/usb-storage/ihg-client.ucsf.edu/yej/190627_A00269_0205_BHJ7HFDMXX_fastqs_analysis/MS-TCZ-1_pool_1-1/filtered_feature_bc_matrix/barcodes.tsv \
    --out /mnt/usb-storage/ihg-client.ucsf.edu/yej/190627_A00269_0205_BHJ7HFDMXX_fastqs_analysis/MS-TCZ-1_pool_1-1/pileup
josemovi commented 1 year ago

This is a really cool tool! thanks @ghuls . It works well with a VCF file containing SNPs from the 1000GP but it doesn't work with another that comes from microarray data. the error message:

Error: Sorted input specified, but the file out.hg38.vcf has the following out of order record chr1 121275027 JHU_1.120748309 G . . . PR GT 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0

Any ideas of what could be happening?