nerettilab / RepEnrich2

RepEnrich2 is an updated method to estimate repetitive element enrichment using high-throughput sequencing data.
36 stars 9 forks source link

Estimating progress #10

Closed re2srm closed 5 years ago

re2srm commented 5 years ago

Hello,

I am running repenrich on some mouse samples and have a couple of questions. I have completed the repenrich setup and and subset steps and my setup folder has ~1300 pseudogenomes. Each of my samples has around 25 million paired end reads. The repenrich command has now been running for around 20 hours per sample and as you mentioned in one of your previous answers I am trying to estimate progress by counting the number of text files corresponding to the pseudogenomes in the pair_1 or pair_2 subdirectories of the output folder. Right now I have around 250 files and at this rate each sample would take more than a hundred hours to complete. I was wondering if I am estimating progress correctly or did you mean something else? Since I didn't expect the jobs to take so long I will resubmit them with the estimated time. Also any tips on how much memory or cpus to assign to each job would be useful.

Another confusion I have is regarding the repenrich command. I am using the following command and have simultaneously submitted 6 jobs for each of my 6 samples (with just the sample names changed in each job submission):

python RepEnrich2.py mm10.fa.out results sample1 sample1_multimap_R1.fastq --fastqfile2 sample1_multimap_R2.fastq sample1_unique.bam --cpus 16 --pairedend TRUE

When I check the 'results' folder I see 6 different regionsorter.txt files for each of my samples but only two temporary subdirectories (pair_1 and pair_2). This made me think that maybe the output of the different samples is overwriting each other in the output folder. I was wondering if I should specify a different output folder for each sample or if the script will make a subfolder for each sample anyway?

Thank you!

re2srm commented 5 years ago

There was a problem with my cluster not assigning enough resources leading to the slow run time. The run completed in around 5 hours per sample once the issue was fixed. Thanks