sdparekh / zUMIs

zUMIs: A fast and flexible pipeline to process RNA sequencing data with UMIs
GNU General Public License v3.0
275 stars 68 forks source link

Can skip filtering and restart a failed run from the mapping stage? #347

Closed Ni-Ar closed 1 year ago

Ni-Ar commented 1 year ago

Not really a bug, more like a general usage question. Apologies if this was described somewhere I couldn't find details about this in the Wiki.

I'm running zUMIs on smart-seq3xpress PBMC data (E-MTAB-11452/combinedFCL.read{1,2}.fastq.gz).

The filtering step went fine, but STAR failed because:

Warning messages:
1: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead. 
2: In gsum(n) :
  The sum of an integer column for a group was more than type 'integer' can hold so the result has been coerced to 'numeric' automatically for convenience.

Fatal LIMIT error: the number of junctions to be inserted on the fly =2056910 is larger than the limitSjdbInsertNsj=1000000
SOLUTION: re-run with at least --limitSjdbInsertNsj 2056910

So in the yaml file I'm gonna add the following line that should hopefully solve this problem.

additional_STAR_params: --clip3pAdapterSeq CTGTCTCTTATACACATCT --limitOutSJcollapsed 3000000 --limitSjdbInsertNsj 3000000

The log output I got so far was:

Mon Jan 30 16:11:39 CET 2023
Filtering...
Tue Jan 31 17:34:58 CET 2023
[1] "13792 barcodes detected."
[1] "221517351 reads were assigned to barcodes that do not correspond to intact cells."
Mapping...
[1] "2023-01-31 17:36:47 CET"
Jan 31 17:36:53 ..... started STAR run
Jan 31 17:37:05 ..... loading genome
Jan 31 17:38:26 ..... processing annotations GTF
Jan 31 17:38:55 ..... inserting junctions into the genome indices
Jan 31 17:45:39 ..... started 1st pass mapping
Feb 01 16:32:43 ..... finished 1st pass mapping
Wed Feb  1 16:33:24 CET 2023

In order to save time and resources, if I keep my out_dir path the same and re-run with the same yaml parameters, can I skip the Filtering step and repeat only the Mapping?

I believe this could be done with this, right?

which_Stage: Mapping

Versions:

zUMIs v2.9.7d
STAR v2.7.1a

CentOS release 7.2

Thanks a lot! Nicco

cziegenhain commented 1 year ago

Hi Nicco,

Sorry about the STAR fail! You can also set the limits even higher, shouldn't hurt! Exactly, you can just set the stage to Mapping as you found.

Just make sure you edit the changes into the .run.yaml that zUMIs has created when you first started the run, there is some things that get added by the pipeline in there.

Small sidenote: Depending on how many resources (threads / memory) you make available for zUMIs, rerunning from Filtering can still be faster because then its possible to run a bunch of STAR jobs in parallel.

Let me know if I can help otherwise! Best, Christoph

Ni-Ar commented 1 year ago

Oops I deleted the .run.yaml file as I thought it might cause some conflicts in the re-running 🤦‍♂️ I anyway increased the limits and asked for more resources; the re-run job is still queued. I'll see how things develop. Thanks for the help!

Ni-Ar commented 1 year ago

Hi again,

I restarted the Mapping stage without the .run.yaml file, but the job allocated resources finished shortly after mapping ended and the Counting started. I believe the start mapping went well as I was checking the .Out.log file and it showed no error.

So I restarted the Counting stage (this time with the .run.yaml file present and I got the following error:

[E::hts_open_format] Failed to open file "/path/to/HagemannJensen_2022/output_dir/hPBMCs_RunF.filtered.Aligned.GeneTagged.sorted.bam.tmp.**NNNN**.bam" : File exists

where "NNNN" applies for files 0000, 0015, 0004, 0011, 0013... (it looks random to me).

If I do the following:

ls -l /path/to/HagemannJensen_2022/output_dir/hPBMCs_RunF.filtered.Aligned.GeneTagged.sorted.bam.tmp* | wc -l
704

There are 704 temp files.

I addition I get this in the error log:

[E::hts_open_format] Failed to open file "/path/to/HagemannJensen_2022/output_dir/hPBMCs_RunF.filtered.Aligned.GeneTagged.sorted.bam" : No such file or directory
samtools index: failed to open "/path/to/HagemannJensen_2022/output_dir/hPBMCs_RunF.filtered.Aligned.GeneTagged.sorted.bam": No such file or directory

Meaning something failed to create a sorted output?

And the error log continues with:

Error in value[[3L]](cond) : 
  failed to open BamFile: file(s) do not exist:
  ‘/path/to/HagemannJensen_2022/output_dir/hPBMCs_RunF.filtered.Aligned.GeneTagged.sorted.bam’
Calls: reads2genes_new ... tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>
Execution halted
Loading required package: yaml
Loading required package: Matrix
Warning message:
package ‘R6’ was built under R version 4.0.5 
Error in gzfile(file, "rb") : cannot open the connection
Calls: rds_to_loom -> readRDS -> gzfile
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file '/path/to/HagemannJensen_2022/output_dir/zUMIs_output/expression/hPBMCs_RunF.dgecounts.rds', probable reason 'No such file or directory'
Execution halted

and

Error in gzfile(file, "rb") : cannot open the connection
Calls: readRDS -> gzfile
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file '/path/to/HagemannJensen_2022/output_dir/zUMIs_output/expression/hPBMCs_RunF.dgecounts.rds', probable reason 'No such file or directory'
Execution halted

which I assume are subsequent errors cause by the fact that the previous bam file was not generated.

Any solution? Besides re-running everything from the beginning which is gonna take ~5 days.

Thanks a lot! Nicco

cziegenhain commented 1 year ago

Hi Nicco,

Sorry I must have totally missed your message here! The tmp files you see are from samtools sort, so that must have definitely aborted somehow. From your description its a bit unclear if the mapping output is really fine or no, so maybe the safest route I would suggest is too bite the bullet and run it from the top. Make sure that you can have sufficient open file handles so that the samtools sort step can allocate as many tmp files as it needs! using the ulimit -n command in your shell.

Anyways since I am so late to reply here, you probably already managed to solve this but let me know if I can help further!

best, Christoph

Ni-Ar commented 1 year ago

Hey yes, I re-run end-to-end everything and it worked fine. Thanks a lot, Nicco