natsuhiko / rasqual

Robust Allele Specific Quantification and quality controL
36 stars 19 forks source link

ASVCF Memory Usage #40

Open SaideepGona opened 4 years ago

SaideepGona commented 4 years ago

Are there any guidelines on the memory consumption of createASVCF.sh? I've run out of memory even when allocating 150GB. I don't really know how to best tell if this is normal or if I'm doing something wrong. There needs to be at least some usage guidelines on this I think.

SaideepGona commented 4 years ago

In addition, ASEReadCounter doesn't output bam files but rather count tables. Is this format accepted automatically? How do we link these together?

natsuhiko commented 4 years ago

Hi,

How may samples do you have?

The ASEReadCount output has to be manually combined with the VCF file by yourself.

Best regards, Natsuhiko

SaideepGona commented 4 years ago

I have 35 samples in this run.

I see. It's somewhat simpler on my end to be able to just run a single memory heavy job to do the work, but filtering and manual assignment would be the more distributed option.

By the way, I made a fork at: https://github.com/SaideepGona/rasqual, and have been working on a SLURM compatible luigi pipeline to kind of help automate the entire process (currently for RNAseq). As the primary author this might be something you'd find interesting, and I would appreciate your feedback as there are many moving parts

SaideepGona commented 4 years ago

I found this: https://github.com/walaj/VariantBam

It allows for filtering a bam file based on a VCF to create a smaller bam file which can be used instead. I don't know how much of an improvement it will make in practice, but should help

SaideepGona commented 4 years ago

So the original issue here I think is solved. I just wanted to follow up and ask about the assay_type parameter. Is it fair to use "atac" mode for other peak-based data? if not, what differences should exist? Thanks!

natsuhiko commented 4 years ago

Sorry for the late reply. I was going to say you have to split the master VCF into chunks (e.g., 10Mb each) to save the memory usage.

You can use 'atac' option for other peak-based data (such as ChIP-seq, DNase-seq, etc.). The difference between RNA-seq and ATAC-seq is the insert size threshold (RNA-seq paired end reads easily span 10Kb or more if they are spliced.).