sdparekh / zUMIs

zUMIs: A fast and flexible pipeline to process RNA sequencing data with UMIs
GNU General Public License v3.0
271 stars 67 forks source link

zUMIs with velocity #114

Closed hmassalha closed 5 years ago

hmassalha commented 5 years ago

Hi all, We are trying to run velocity using the BAM files that we got from the zUMIs pipeline. One of the requests of velocity is one BAM file the is including data for exons and introns. We are wondering if we can use the BAM files that are produced through the zUMIs pipeline. We have 3 BAM files:

  1. '.aligned.sorted.bam'
  2. '.aligned.sorted.bam.ex.featureCounts'
  3. '.aligned.sorted.bam.in.featureCounts'

What is the difference between the three files? How we can merge exons data with introns data in one BAM file? Any additional tip on how to use the zUMIs output for velocity is appreciated.

Thanks a lot, HM

cziegenhain commented 5 years ago

Hi,

I just implemented to run RNA velocity within zUMIs. Please set the option in the YAML file to "velocyto: yes" and rerun 👍 (This assumes that velocyto is installed in your system, otherwise zUMIs will still leave a velocyto-compatible bam file for you)

Let me know how it goes!

hmassalha commented 5 years ago

Thanks a lot, I will try it and update you. HM

hmassalha commented 5 years ago

Dear Christoph,

I have a few technical questions with regard to the YAML file.

  1. the field 'zUMIs_directory' in the YAML file is automatically updated to my working dir! while I have the zUMIs_dir in another directory on the server. Each time I call the YAML file I have to reupdate this field. Is that OK?. I keep getting the following error tee:/home/labs/shalev/hassanm/NGS/All_humans_data_for_velocity/180509_NB551168_0120_AHTKN2BGX5_human5_output/zUMIs_runlog.txt: No such file or directory, and I know that I don't have the zUMIs_runlog.txt file in my working dir. Is this related to the 'zUMIs_directory' field in the YAML file?
  2. The link for the zUMIs2.4.0 in the GitHub homepage is not working!
  3. My experiment was done using MARseq protocol. I have plate barcodes on top of the UMI and the well barcode. How I can make the YAML file to process this information? do I have to split the fastq files differently?
  4. I keep getting a strange character '\r' in the output message mkdir: cannot create directory ‘/home/labs/shalev/hassanm/NGS/All_humans_data_for_velocity/180509_NB551168_0120_AHTKN2BGX5_human5_output\r/zUMIs_output/’: No such file or directory. I rechecked all my command lines and I don't have such character. Why do I keep getting this? (I am using windows system)

Thanks a lot, HM

hmassalha commented 5 years ago

Hi, I have updates that might save you time answering the above questions. I managed to run the YAML file, and I am no longer facing problem 1,3,4. The \r is a hidden character from windows!

My new question is the following: In the lab we share sequencing runs, meaning my bcl files containing data that is not related to my experiment. could you help me please in who I should run my bcl2fastq command? and where I should add my barcodes in the YAML file? I mean: file1 is read 1 [BC(1-6), UMI(7-16)] file 2 is read 2 [cDNA(1-66)] I am missing the plate barcode which is supposed to be in read 1 if I am not mistaken!

Where I should add the sequences of the plate barcodes? and the well barcodes?

Thanks a lot, HM

cziegenhain commented 5 years ago

Hi HM,

sorry for the slow replies. Just a quick re to 1.: as you probably saw its the -d flag when calling zUMIs from the command line. It was necessary to pass this as a shell option unfortunately but zUMIs will always keep your YAML in sync with what you had passed in the shell call.

  1. Thanks for the hint, I fixed that now!

  2. I am not exactly certain from the top of my head how the plate barcode is in MARS-seq. In SCRB-seq, it is the i7 illumina index. To get that, you would run bcl2fastq with the following additional flag: --create-fastq-for-index-reads then the additional plate-BC fastq file would just be added as file3 in the YAML setup, eg with BC(1-8). If there are other data on your sequencing runs you could try to create a sample-sheet for the bcl2fastq to get rid of unncessary data that belongs to other people - but for that you would need to know your & their barcodes maybe?

Hope this helps, Christoph

hmassalha commented 5 years ago

Hi Christoph, Thanks for your reply. I got an error during the run and here is the output that I got:

`Mon Apr 1 11:58:20 IDT 2019 Filtering... Mon Apr 1 14:56:35 IDT 2019 Mapping...

EXITING: FATAL INPUT ERROR: unrecoginzed parameter name "readFilesType" in input "Command-Line-Initial" SOLUTION: use correct parameter name (check the manual)

Apr 01 14:56:36 ...... FATAL ERROR, exiting Mon Apr 1 14:56:36 IDT 2019 Counting... Error in file(filename, "r", encoding = encoding) : cannot open the connection Calls: source -> file In addition: Warning message: In file(filename, "r", encoding = encoding) : cannot open file '/runfeatureCountFUN.R': No such file or directory Execution halted Mon Apr 1 14:56:37 IDT 2019 [1] "2019-04-01 14:56:38 IDT" [1] "Preparing bam file for velocyto..." [E::hts_open_format] Failed to open file /home/labs/shalev/hassanm/NGS/All_humans_data_for_velocity/180326_NB551168_0097_AHKHNKBGX5_human4_output/human4.filtered.tagged.Aligned.out.bam samtools view: failed to open "/home/labs/shalev/hassanm/NGS/All_humans_data_for_velocity/180326_NB551168_0097_AHKHNKBGX5_human4_output/human4.filtered.tagged.Aligned.out.bam" for reading: No such file or directory [1] "2019-04-01 14:56:38 IDT" Error in data.table::fread(paste0(opt$out_dir, "/zUMIs_output/", opt$project, : File '/home/labs/shalev/hassanm/NGS/All_humans_data_for_velocity/180326_NB551168_0097_AHKHNKBGX5_human4_output/zUMIs_output/human4kept_barcodes.txt' does not exist. Include one or more spaces to consider the input a system command. Execution halted Descriptive statistics... [1] "I am loading useful packages for plotting..." [1] "2019-04-01 14:56:38 IDT" Error in file(filename, "r", encoding = encoding) : cannot open the connection Calls: source -> file In addition: Warning message: In file(filename, "r", encoding = encoding) : cannot open file '/statsFUN.R': No such file or directory Execution halted Mon Apr 1 14:56:40 IDT 2019`

I have another suggestion, would it be possible to write output to the '.zUMIs_runlog.txt' file during the run, such as '[1] "I am loading useful packages for plotting..."'? this is important for us as users to see that the zUMIs is working (making something) and don't have a problem with the server or job submission.

Thanks, HM

cziegenhain commented 5 years ago

hey,

this looks like there is still something odd with your path to the zUMIs folder. Please double check! eg: cannot open file '/runfeatureCountFUN.R': No such file or directory

Thanks for your suggestion to write all verbose into a file additionally, we might add that at a later time

hmassalha commented 5 years ago

Thanks for the quick reply,

I was missing the 'zUMIs_directory' line in the YAML file. Do you think this is the problem? The path for the zUMIs directory is OK since I got the correct parameters that I provided in the '.zUMIs_runlog.txt' file. I just rerun it and I will update you of what I will get,

Thanks, HM

hmassalha commented 5 years ago

Hi Christoph, Adding the 'zUMIs_directory' lines in the YAML file did help, however, I still facing another problem. Here is the output that I got:

EXITING: FATAL INPUT ERROR: unrecoginzed parameter name "readFilesType" in input "Command-Line-Initial"
SOLUTION: use correct parameter name (check the manual)

Apr 01 22:26:30 ...... FATAL ERROR, exiting
Mon Apr  1 22:26:30 IDT 2019
Counting...
[1] "2019-04-01 22:26:39 IDT"

Read 35.8% of 6164808 rows
Read 51.4% of 6164808 rows
Read 85.5% of 6164808 rows
Read 6164808 rows and 2 (of 2) columns from 0.082 GB file in 00:00:05
[1] "96520 barcodes detected."
Error in loadNamespace(name) : there is no package called ‘ggrastr’
Calls: cellBC ... tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>
Execution halted
Mon Apr  1 22:28:23 IDT 2019
Descriptive statistics...
[1] "I am loading useful packages for plotting..."
[1] "2019-04-01 22:28:23 IDT"
Error in .guessPackageName(env) : 
  none of the objects in the source code could be found:  need to attach or specify the package
Calls: suppressMessages ... withCallingHandlers -> insertSource -> .guessPackageName
Execution halted
Mon Apr  1 22:28:24 IDT 2019

Any suggestion...

Thanks a lot, HM

cziegenhain commented 5 years ago

That reads like your star version is too old. Which one are you running? You are also lacking an R package further down, so please check carefully here: https://github.com/sdparekh/zUMIs/wiki/Installation

hmassalha commented 5 years ago

Hi, I am working with STAR 2.5.3a

I ran again the pipeline after updating the packages of R and I get the following errors:

` CPU time : 50748.70 sec. Max Memory : 1349 MB Average Memory : 195.26 MB Total Requested Memory : 72000.00 MB Delta Memory : 70651.00 MB Max Swap : - Max Processes : 48 Max Threads : 103 Run time : 5696 sec. Turnaround time : 5699 sec.

The output (if any) follows:


Good news! A newer version of zUMIs is available at https://github.com/sdparekh/zUMIs


You provided these parameters: YAML file: /home/labs/shalev/hassanm/.Documents/temp_OD/jobs/YAML_win.yaml zUMIs directory: /home/labs/shalev/hassanm/NGS/zUMI/zUMIs_2_4_0/zUMIs-master STAR executable STAR samtools executable samtools pigz executable pigz Rscript executable Rscript RAM limit: null zUMIs version 2.4.0e

Fri Apr 5 08:08:35 IDT 2019 Filtering... Fri Apr 5 09:41:50 IDT 2019 Mapping...

EXITING: FATAL INPUT ERROR: unrecoginzed parameter name "readFilesType" in input "Command-Line-Initial" SOLUTION: use correct parameter name (check the manual)

Apr 05 09:41:52 ...... FATAL ERROR, exiting Fri Apr 5 09:41:52 IDT 2019 Counting... [1] "2019-04-05 09:42:01 IDT" [1] "Warning! None of the annotated barcodes were detected." [1] "Continuing with top 100 barcodes instead..." [1] "238710457 reads were assigned to barcodes that do not correspond to intact cells." [1] "Inf Reads per chunk" [1] "Loading reference annotation from:" [1] "/home/labs/shalev/hassanm/NGS/All_humans_data_for_velocity/180326_NB551168_0097_AHKHNKBGX5_human4_output/human4.final_annot.gtf" [1] "Annotation loaded!" [1] "Assigning reads to features (ex)"

ERROR: invalid parameter: '/home/labs/shalev/hassanm/NGS/All_humans_data_for_velocity/180326_NB551168_0097_AHKHNKBGX5_human4_output/human4.filtered.tagged.Aligned.out.bam'

Error in file(file, "rt") : cannot open the connection Calls: .runFeatureCount -> -> read.delim -> read.table -> file In addition: Warning message: In file(file, "rt") : cannot open file './.Rsubread_featureCounts_pid358782': No such file or directory Execution halted Fri Apr 5 09:43:25 IDT 2019 [1] "2019-04-05 09:43:25 IDT" [1] "Preparing bam file for velocyto..." [E::hts_open_format] Failed to open file /home/labs/shalev/hassanm/NGS/All_humans_data_for_velocity/180326_NB551168_0097_AHKHNKBGX5_human4_output/human4.filtered.tagged.Aligned.out.bam samtools view: failed to open "/home/labs/shalev/hassanm/NGS/All_humans_data_for_velocity/180326_NB551168_0097_AHKHNKBGX5_human4_output/human4.filtered.tagged.Aligned.out.bam" for reading: No such file or directory [1] "2019-04-05 09:43:25 IDT" [1] "Attempting to run RNA velocity..." which: no velocyto in (/usr/share/lsf/10.1/linux3.10-glibc2.17-x86_64/bin:/apps/RH7U2/gnu/R/3.4.3/bin:/apps/RH7U2/general/zUMIs/2018-03-04/:/apps/RH7U2/gnu/STAR/2.5.2b/bin:/apps/RH7U2/gnu/mysql/5.7.16/bin:/apps/RH7U2/intel/pigz/2.3.4/bin:/apps/RH7U2/gnu/samtools/1.9/bin:/apps/RH7U2/gnu/bzip2/1.0.6/bin:/apps/RH7U2/gnu/star/2.5.3a/bin:/apps/RH7U2/gnu/R/3.5.0/bin:/apps/RH7U2/general/zUMIs/2019-04-02/:/apps/RH7U2/gnu/bcl2fastq/2.20.0.422/bin:/apps/RH7U2/general/cuda/7.5/bin:/apps/RH7U2/general/matlab/R2018a/bin/glnxa64:/apps/RH7U2/general/matlab/R2018a/bin:/usr/share/lsf/10.1/linux3.10-glibc2.17-x86_64/etc:/usr/lpp/mmfs/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin:/usr/lpp/mmfs/bin:/home/labs/shalev/hassanm/bin) [1] "No velocyto installation found in path. Please install it via pip." [1] "2019-04-05 09:43:25 IDT" Descriptive statistics... [1] "I am loading useful packages for plotting..." [1] "2019-04-05 09:43:26 IDT" Error in .guessPackageName(env) : none of the objects in the source code could be found: need to attach or specify the package Calls: suppressMessages ... withCallingHandlers -> insertSource -> .guessPackageName Execution halted Fri Apr 5 09:43:27 IDT 2019 `

I see a few things from this report:

  1. filtering was good
  2. mapping is problematic. I am using the STAR program and the index file that worked for me with the old version of zUMIs.
  3. the barcodes were not detected!. I am using MARSseq with demultiplexing. Read 1 contains plate barcode and cDNA [BC(6-9), cDNA(10-66)], and read 2 contains well barcode and UMI [BC(1-7), UMI(8-15)]. My question is the following: since I have two barcodes (one on each strand) and the YAML file accepting one barcode file. Should the YAML file contain the concatenated two barcodes of both reads? if yes, Dose that means that the YAML file should contain 384 well barcodes * 24 plate barcodes combinations?

Thanks a lot, HM

cziegenhain commented 5 years ago

Hi,

Your star version is simply too old. https://github.com/sdparekh/zUMIs/wiki/Installation

For the barcode error, yes the barcode strings you give (one BC per line) needs to be the concatenated plateBC+cellBC. https://github.com/sdparekh/zUMIs/wiki/Barcodes

hmassalha commented 5 years ago

Hi, I just got a full run. and here is the output

Successfully completed.

Resource usage summary:

CPU time :                                   121812.00 sec.
Max Memory :                                 70935 MB
Average Memory :                             18483.89 MB
Total Requested Memory :                     150000.00 MB
Delta Memory :                               79065.00 MB
Max Swap :                                   -
Max Processes :                              117
Max Threads :                                289
Run time :                                   10718 sec.
Turnaround time :                            10722 sec.

The output (if any) follows:


Good news! A newer version of zUMIs is available at https://github.com/sdparekh/zUMIs


You provided these parameters: YAML file: /home/labs/shalev/hassanm/.Documents/temp_OD/jobs/YAML_win.yaml zUMIs directory: /home/labs/shalev/hassanm/NGS/zUMI/zUMIs_2_4_0/zUMIs-master STAR executable /home/labs/shalev/hassanm/NGS/zUMI/STAR_2_7/STAR/bin/Linux_x86_64_static/STAR samtools executable samtools pigz executable pigz Rscript executable Rscript RAM limit: null zUMIs version 2.4.0e

Mon Apr 8 08:44:47 IDT 2019 Filtering... Mon Apr 8 09:39:06 IDT 2019 Mapping... Apr 08 09:39:08 ..... started STAR run Apr 08 09:39:09 ..... loading genome Apr 08 09:39:23 ..... processing annotations GTF Apr 08 09:39:35 ..... inserting junctions into the genome indices Apr 08 09:41:28 ..... started 1st pass mapping Apr 08 10:00:37 ..... finished 1st pass mapping Apr 08 10:00:38 ..... inserting junctions into the genome indices Apr 08 10:02:00 ..... started mapping Apr 08 10:21:41 ..... finished mapping Apr 08 10:21:41 ..... finished successfully Mon Apr 8 10:21:41 IDT 2019 Counting... [1] "2019-04-08 10:21:51 IDT" [1] "36399086 reads were assigned to barcodes that do not correspond to intact cells." [1] "Inf Reads per chunk" [1] "Loading reference annotation from:" [1] "/home/labs/shalev/hassanm/NGS/All_humans_data_for_velocity/180326_NB551168_0097_AHKHNKBGX5_human4_output/human4.final_annot.gtf" [1] "Annotation loaded!" [1] "Assigning reads to features (ex)"

    ==========     _____ _    _ ____  _____  ______          _____  
    =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
      =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
        ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
          ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
    ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
   Rsubread 1.28.1
//========================== featureCounts setting ===========================\ Input files : 1 BAM file S /home/labs/shalev/hassanm/NGS/All_humans_d ...
Dir for temp files : .
Threads : 19
Level : meta-feature level
Paired-end : yes
Strand specific : stranded
Multimapping reads : primary only
Multi-overlapping reads : not counted
Min overlapping bases : 1
Chimeric reads : not counted
Both ends mapped : not required

\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\ Load annotation file ./.Rsubread_UserProvidedAnnotation_pid33992 ... Features : 576330 Meta-features : 49614 Chromosomes/contigs : 47
Process BAM file /home/labs/shalev/hassanm/NGS/All_humans_data_for_vel ...
Single-end reads are included.
Assign reads to features...
Total reads : 282130265
Successfully assigned reads : 132509211 (47.0%)
Running time : 9.06 minutes
Read assignment finished.

\===================== http://subread.sourceforge.net/ ======================//

[1] "Assigning reads to features (in)"

    ==========     _____ _    _ ____  _____  ______          _____  
    =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
      =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
        ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
          ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
    ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
   Rsubread 1.28.1
//========================== featureCounts setting ===========================\ Input files : 1 BAM file S /home/labs/shalev/hassanm/NGS/All_humans_d ...
Dir for temp files : .
Threads : 19
Level : meta-feature level
Paired-end : yes
Strand specific : stranded
Multimapping reads : primary only
Multi-overlapping reads : not counted
Min overlapping bases : 1
Chimeric reads : not counted
Both ends mapped : not required

\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\ Load annotation file ./.Rsubread_UserProvidedAnnotation_pid33992 ... Features : 275949 Meta-features : 28257 Chromosomes/contigs : 36
Process BAM file /home/labs/shalev/hassanm/NGS/All_humans_data_for_vel ...
Single-end reads are included.
Assign reads to features...
Total reads : 282130265
Successfully assigned reads : 27069971 (9.6%)
Running time : 9.02 minutes
Read assignment finished.

\===================== http://subread.sourceforge.net/ ======================//

[bam_sort_core] merging from 20 files and 10 in-memory blocks... [bam_sort_core] merging from 20 files and 10 in-memory blocks... [1] "Here are the detected subsampling options:" [1] "Automatic downsampling" [1] "Working on barcode chunk 1 out of 1" [1] "Processing 9020 barcodes in this chunk..." [1] "2019-04-08 11:31:50 IDT" [1] "I am done!! Look what I produced.../home/labs/shalev/hassanm/NGS/All_humans_data_for_velocity/180326_NB551168_0097_AHKHNKBGX5_human4_output/zUMIs_output/" used (Mb) gc trigger (Mb) max used (Mb) Ncells 4839517 258.5 9968622 532.4 8273852 441.9 Vcells 1005052183 7668.0 3101409291 23661.9 3876761302 29577.4 Mon Apr 8 11:31:52 IDT 2019 [1] "2019-04-08 11:31:52 IDT" [1] "Preparing bam file for velocyto..." [bam_sort_core] merging from 0 files and 19 in-memory blocks... [1] "2019-04-08 11:43:21 IDT" [1] "Attempting to run RNA velocity..." which: no velocyto in (/usr/share/lsf/10.1/linux3.10-glibc2.17-x86_64/bin:/apps/RH7U2/gnu/R/3.4.3/bin:/apps/RH7U2/general/zUMIs/2018-03-04/:/apps/RH7U2/gnu/STAR/2.5.2b/bin:/apps/RH7U2/gnu/mysql/5.7.16/bin:/apps/RH7U2/intel/pigz/2.3.4/bin:/apps/RH7U2/gnu/samtools/1.9/bin:/apps/RH7U2/gnu/bzip2/1.0.6/bin:/apps/RH7U2/gnu/star/2.5.3a/bin:/apps/RH7U2/gnu/R/3.5.0/bin:/apps/RH7U2/general/zUMIs/2019-04-02/:/apps/RH7U2/gnu/bcl2fastq/2.20.0.422/bin:/apps/RH7U2/general/cuda/7.5/bin:/apps/RH7U2/general/matlab/R2018a/bin/glnxa64:/apps/RH7U2/general/matlab/R2018a/bin:/usr/share/lsf/10.1/linux3.10-glibc2.17-x86_64/etc:/usr/lpp/mmfs/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin:/usr/lpp/mmfs/bin:/home/labs/shalev/hassanm/bin) [1] "No velocyto installation found in path. Please install it via pip." [1] "2019-04-08 11:43:22 IDT" Descriptive statistics... [1] "I am loading useful packages for plotting..." [1] "2019-04-08 11:43:22 IDT" Error in .guessPackageName(env) : none of the objects in the source code could be found: need to attach or specify the package Calls: suppressMessages ... withCallingHandlers -> insertSource -> .guessPackageName Execution halted Mon Apr 8 11:43:23 IDT 2019

It seems that the run went ok, thanks for the help. One small question regarding the barcodes. I have 2 barcodes (plate and cells barcodes) one on each read. Since the yaml file should have a path for merged barcodes file, Does the order of the file1 [BC(1-4)] and file2 [BC(1-7)] should be the same order of the concatenating barcodes file [BC(1-4) BC(1-7)]?

Thanks a lot, HM

cziegenhain commented 5 years ago

Looks like zUMIs is doing what it should so I am closing this issue! Yes, the order of the concatenated barcode strings and the order of barcode ranges given in the file setup matter and should match!

Best Christoph