sdparekh / zUMIs

zUMIs: A fast and flexible pipeline to process RNA sequencing data with UMIs
GNU General Public License v3.0
275 stars 67 forks source link

Multiple fastq files (multiple samples) #49

Closed gk-bioin4m8x closed 6 years ago

gk-bioin4m8x commented 6 years ago

Hi @sdparekh @cziegenhain I have followed https://github.com/sdparekh/zUMIs/wiki/Usage . However, I am wondering if I have multiple fastq.gz files (multiple samples), how would I start? Can I input the folder with multiple fastq files (including transcript fastq and barcode fastq) in the bash script with *? Will it automatically detect from the name that two files (for e.g. 3_R1.fastq.gz for barcode and 3_R2.fastq.gz for transcript belong to sample 3). Please guide.

Thanks.

sdparekh commented 6 years ago

Hi,

You can merge multiple samples and give the sample barcodes in a text file with -b option. If you don't want to merge, you can run zUMIs in a for loop with multiple samples. It is more efficient to merge and run the samples. You can separate them later from the count tables.

Best, Swati

gk-bioin4m8x commented 6 years ago

Thanks @sdparekh ! I will try that.

gk-bioin4m8x commented 6 years ago

Hi @sdparekh ,

I ran zUMIs bash script on merged files. It ran without any errors, but I neither got .featureCounts files under "out" dir nor anything under "out/zUMIs/output/stats/" and "out/zUMIs/output/expression/".

I re-run the script by adding -w Counting, but it has replaced the filtered files with their shortcuts under "out/zUMIs/output/filtered_fastq/".

Please guide.

Thanks.

cziegenhain commented 6 years ago

Hey,

thats odd. The moving of the fastq files happens when zUMIs thinks it has finished and should clean the folder up.

I think to troubleshoot this it would be great if you could attach

gk-bioin4m8x commented 6 years ago

@cziegenhain Thanks. Sorry that was after -w Summarizing option.

I got the error related to usage of perl fqcheck.pl after -w Counting option on standard output.

I already lost my filtered files, so need to run zUMIs again.

Do you have any recommendation why I am not getting the .featureCounts files and the files under stats folder so that I can edit my script?

Here are my commands inside bash script:

p=/path/to/zUMIs      
e=/path/to/project

bash $p/zUMIs-master.sh -f $e/merged_barcode.fastq.gz -r $e/merged_cDNA.fastq.gz -n project -g /path/to/star4zUMIs -a /path/to/.gtf  -c 1-6 -m 1-8 -l 75 -p 8 -b $e/my_zUMIs/barcode_samples.txt -o $e/my_zUMIs/out

OS: Linux 2014 x86_64 R version: 3.4.3

cziegenhain commented 6 years ago

Thanks for the feedback, I am a bit confused to be honest.

Whats the version of Rsubread you are using?

If you could just rerun all the way from the beginning and send the log files with standard out & errors, that would be very helpful. Also would be good so see the folder contents after the run with file sizes. (ls -sh)

sdparekh commented 6 years ago

The issue is your path variable. You don't give "/path/to", you should give a real path to that folder. For instance if I stored zUMIs in a folder named projects which is under /data directory then I would give absolute path to zUMIs like this. /data/projects/zUMIs. The program is not able to identify the original path to inputs, output directory and to zUMIs.

gk-bioin4m8x commented 6 years ago

@sdparekh I gave real path only, that was for demo purpose. :-)

sdparekh commented 6 years ago

Haha!! Okay that is still in the confused state. I would not know without knowing the error causiing your issue. Can you please do as Christoph said :)

gk-bioin4m8x commented 6 years ago

Yes, I proceeded in that way. zUMI is running from beginning :-) Will let you know as soon as I am finished.

gk-bioin4m8x commented 6 years ago

@cziegenhain @sdparekh Still same issues: 1) No ex.featureCounts, in.featureCounts and Rplots.pdf files under "out" dir 2) No files under "out/zUMIs/output/stats/" and "out/zUMIs/output/expression/"

Rsubread version 1.28.1

Script:

p=/path/to/zUMIs      
e=/path/to/project

bash $p/zUMIs-master.sh -i $p -V /home/me/R-3.4.3/bin -f $e/merged_barcode.fastq.gz -r $e/merged_cDNA.fastq.gz -n project -g /path/to/star4zUMIs -a /path/to/.gtf  -c 1-6 -m 1-8 -l 75 -p 8 -b $e/my_zUMIs/barcode_samples.txt -o $e/my_zUMIs/out

Folder contents:

$ cd my_zUMIs/out
$ ls -sh
total 59G
 19G project.aligned.sorted.bam              28K project.Log.out            512 project._STARgenome
 41G project.barcodelist.filtered.sort.sam   17K project.Log.progress.out   512 project._STARpass1
2.0K project.Log.final.out                  7.9M project.SJ.out.tab         512 zUMIs_output

$ cd zUMIs_output
$ ls -sh
total 512
  0 expression  512 filtered_fastq    0 stats

$ cd filtered_fastq
$ ls -sh
total 17G
3.7G project.barcoderead.filtered.fastq.gz   13G project.cdnaread.filtered.fastq.gz

Please let me know further.

Thanks.

gk-bioin4m8x commented 6 years ago

@cziegenhain @sdparekh Any updates please?

cziegenhain commented 6 years ago

Hey, I dont see any obvious mistake so far but you again did not post the verbose of zUMIs so we cant know for sure.

Can we also see the content of the STAR report? project.Log.final.out

gk-bioin4m8x commented 6 years ago

@cziegenhain Here it is:

zUMIs version 0.0.6c 

Raw reads: <some number> 
Filtered reads: <some number> 

Make sure you have approximately 71677 Mb RAM available ..... started STAR run
..... loading genome
..... processing annotations GTF
..... inserting junctions into the genome indices
..... started 1st pass mapping
..... finished 1st pass mapping
..... inserting junctions into the genome indices
..... started mapping
..... finished successfully
[bam_sort_core] merging from 32 files and 8 in-memory blocks...
[bam_sort_core] merging from 24 files and 8 in-memory blocks...
/zUMIs/zUMIs-noslurm.sh: line 112: /home/me/R-3.4.3/bin: is a directory
/zUMIs/zUMIs-noslurm.sh: line 116: /home/me/R-3.4.3/bin: is a directory

I think I should I have given path for R like this /home/me/R-3.4.3/bin/R

I don't want to start zUMIs from beginning, so I should start by adding R and -w Counting inside my above mentioned script? Do I need to do anything else?

p=/path/to/zUMIs      
e=/path/to/project

bash $p/zUMIs-master.sh -i $p -V /home/me/R-3.4.3/bin/R -f $e/merged_barcode.fastq.gz -r $e/merged_cDNA.fastq.gz -n project -g /path/to/star4zUMIs -a /path/to/.gtf  -c 1-6 -m 1-8 -l 75 -p 8 -b $e/my_zUMIs/barcode_samples.txt -o $e/my_zUMIs/out -w Counting

Please guide.

Thanks.

cziegenhain commented 6 years ago

Yes, that seems to be the problem! It should work to resume the processing using the correct path and -w Counting!

gk-bioin4m8x commented 6 years ago

Ok, thanks.

gk-bioin4m8x commented 6 years ago

@cziegenhain I did that accordingly and it has been running since yesterday morning. How much time it would take?

cziegenhain commented 6 years ago

zUMIs is usually very fast. However it all depends on the number of reads, your machine configuration and load. Also note that hamming distance operations are computationally costly in case you are using this settings.

gk-bioin4m8x commented 6 years ago

Ok. I did not use Hamming distance option (-H).

gk-bioin4m8x commented 6 years ago

@cziegenhain Just to update. zUMIs which I started 3 days ago with -w Counting is still running. Although it has created shortcuts for two files under "out" folder (project.aligned.sorted.bam.in of 1 KB and project.aligned.sorted.bam.ex of 1 KB), but still nothing under out/zUMIs/output/stats and out/zUMIs/output/expression. Following are the details in log file which has not been updated since then, but if I try to download above files (.in and .ex, they are quite big):

Your jobs will run on this machine. 

Make sure you have more than 31G RAM and 8 processors available. 

Your jobs will be started from counting. 

 You provided these parameters:
 SLURM workload manager:    no
 Summary Stats to produce:  yes
 Start the pipeline from:   counting
 A custom mapped BAM:       NA
 Custom filtered FASTQ:     no
 Barcode read:          $e/merged_barcode.fastq.gz
 cDNA read:         $e/merged_cDNA.fastq.gz
 Study/sample name:     project
 Output directory:      $e/my_zUMIs/out
 Cell/sample barcode range: 1-6
 UMI barcode range:     1-8
 Retain cell with >=N reads:    100
 Genome directory:      star4zUMIs
 GTF annotation file:       my.gtf
 Number of processors:      8
 Read length:           75
 Strandedness:          0
 Cell barcode Phred:        20
 UMI barcode Phred:     20
 # bases below phred in CellBC: 1
 # bases below phred in UMI:    1
 Hamming Distance (UMI):    0
 Hamming Distance (CellBC): 0
 Plate Barcode Read:        NA
 Plate Barcode range:       NA
 Barcodes:          $e/my_zUMIs/barcode_samples.txt
 zUMIs directory:       zUMIs
 STAR executable        STAR
 samtools executable        samtools
 pigz executable        pigz
 Rscript executable     /home/me/R-3.4.3/bin/R
 Additional STAR parameters:    
 STRT-seq data:         no
 InDrops data:          no
 Library read for InDrops:  NA
 Barcode read2(STRT-seq):   NA
 Barcode read2 range(STRT-seq): 0-0
 Bases(G) to trim(STRT-seq):    3
 Subsampling reads:     0 

 zUMIs version 0.0.6c 

ARGUMENT 'zUMIs/zUMIs-dge.R' __ignored__

WARNING: unknown option '--gtf'

ARGUMENT 'my.gtf' __ignored__

WARNING: unknown option '--abam'

ARGUMENT 'out/project.aligned.sorted.bam' __ignored__

WARNING: unknown option '--ubam'

ARGUMENT 'out/project.barcodelist.filtered.sort.sam' __ignored__

WARNING: unknown option '--barcodefile'

ARGUMENT 'barcode_samples.txt' __ignored__

WARNING: unknown option '--out'

ARGUMENT 'out' __ignored__

WARNING: unknown option '--sn'

ARGUMENT 'project' __ignored__

WARNING: unknown option '--cores'

ARGUMENT '8' __ignored__

WARNING: unknown option '--strandedness'

ARGUMENT '0' __ignored__

WARNING: unknown option '--bcstart'

ARGUMENT '1' __ignored__

WARNING: unknown option '--bcend'

ARGUMENT '6' __ignored__

WARNING: unknown option '--umistart'

ARGUMENT '1' __ignored__

WARNING: unknown option '--umiend'

ARGUMENT '8' __ignored__

WARNING: unknown option '--subsamp'

ARGUMENT '0' __ignored__

WARNING: unknown option '--nReadsBC'

ARGUMENT '100' __ignored__

WARNING: unknown option '--hamming'

ARGUMENT '0' __ignored__

WARNING: unknown option '--XCbin'

ARGUMENT '0' __ignored__

R version 3.4.3 (2017-11-30) -- "Kite-Eating Tree"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> 

Do you have any idea? Please guide. Thanks.

cziegenhain commented 6 years ago

Oh no, I think you need to give the path to the Rscript executable as such /home/me/R-3.4.3/bin/Rscript

Otherwise it will just start an inactive R instance.

gk-bioin4m8x commented 6 years ago

Ok, I will restart from -w Counting. Thanks.

MartaBenegas commented 2 years ago

@sdparekh following your first comment on this issue, could you provide an example of the -b option? It would be great if you could provide an example file as well (e.g. is it one sample barcode per line?)

$ zumis -h
/usr/bin/zUMIs-2.9.7/zumis: line 7: curl: command not found
------------- 

 Good news! A newer version of zUMIs is available at https://github.com/sdparekh/zUMIs 

-------------

  USAGE: /usr/bin/zUMIs-2.9.7/zumis [options]
    -h  Print the usage info.

## Required parameters ##

    -y  <YAML config file> : Path to the YAML config file. Required.

## Program path ##
    -d  <zUMIs-dir>      : Directory containing zUMIs scripts.  Default: path to this script.

## Miniconda environment

  -c : Use zUMIs dependencies in the preinstalled conda enviroment.

zUMIs version 2.9.7

Thanks!

cziegenhain commented 2 years ago

Hi Marta,

This issue you are replying to is from 2018 and the information probably not so relevant any longer. You can check the documentation in the wiki to find out how to set up your run parameters:

https://github.com/sdparekh/zUMIs/wiki

MartaBenegas commented 2 years ago

Dear @cziegenhain,

thanks for your quick answer!

I read the documentation but I haven't found (or understood) how to set up the parameters for my analysis, that's how I ended up in this issue.

I have two 10x samples (g001 and g002), each one with two fastq files regarding read 1 and read 2, and two fastq files regarding sample index 1 and 2: image

So I have four files per sample. This configuration is named "Dual Index". So, instead of having 4 barcodes per sample as exemplified in the wiki, I only have two (one forward and one reverse). And these are not present in the reads, they are in the separate fastq files.

Thus, I was going to merge the g001-R1 and g002-R1 in one file (the same for the R2) but then I'm not really sure how to specify the sample indexes.

Any help would be appreciated!

My files:

marta@cyanobacteria:/data/merged_fastq$ head UMH-MO-g001_S2_L001_R1_001.fastq 
@ST-E00129:1195:HHMK3CCX2:1:1101:1773:1397 1:N:0:NATCAGCCTA+NGGACGAAAC
NCCTTAGAGATGTTAGCTGCTCACTCAG
+
#AAFAJJFFF7FJFJJFJFJ<JJJAJA-
@ST-E00129:1195:HHMK3CCX2:1:1101:1813:1397 1:N:0:NATCAGCCTA+NGGACGAAAC
NGCTTGGTCTAAGCGTGTATCGTCGATC
+
#AAAA<FFJJJFFJFJJJJJJAJF7F<F
@ST-E00129:1195:HHMK3CCX2:1:1101:1834:1397 1:N:0:NATCAGCCTA+NGGACGAAAC
NAGCCAGAGGGCCTCTCGTCCTAAAAAT
marta@cyanobacteria:/data/merged_fastq$ head UMH-MO-g001_S2_L001_R2_001.fastq 
@ST-E00129:1195:HHMK3CCX2:1:1101:1773:1397 2:N:0:NATCAGCCTA+NGGACGAAAC
NAGCAACTGGCTCTGGCCCTGGCGGAGAAGTACCGCTAAACTGGAGATAAGCTACTAAACTGTCATCCGAGCATCAAGCCCTCACAGTAT
+
#A---F<F-<--77--A7<JAAJ--7-AJJJJFFAA7JA7AFF-F--<-<-<<-<<7JJ<---7A<A7---7-<-7A-7---7-7A---7
@ST-E00129:1195:HHMK3CCX2:1:1101:1813:1397 2:N:0:NATCAGCCTA+NGGACGAAAC
NGAGGATTAAACCCCAGAATTTCACCTGTCCGCGGACACTTTCCTGAAGCAACTGACATTAGCCGTCGAGGAAAAATACAGCTAAAAAGA
+
#----A----<<-FF-----<<-<--7--7-------7--<<--F--<<-<F-<-----7-----7---77F7--77FA----<FJ----
@ST-E00129:1195:HHMK3CCX2:1:1101:1834:1397 2:N:0:NATCAGCCTA+NGGACGAAAC
NCACCATGAAAGTCCATCATTGGACTCCAGTTCCTGCTCTGTTGTTATTACAATAAAATAAACAGGCAATGAATGATAGAAAAAAAAAAA
marta@cyanobacteria:/data/merged_fastq$ zcat UMH-MO-g001_S2_L001_I1_001.fastq.gz | head
@ST-E00129:1195:HHMK3CCX2:1:1101:1773:1397 1:N:0:NATCAGCCTA+NGGACGAAAC
NATCAGCCTA
+
#-A--A<<A-
@ST-E00129:1195:HHMK3CCX2:1:1101:1813:1397 1:N:0:NATCAGCCTA+NGGACGAAAC
NATCAGCCTA
+
#AA-FAFJJJ
@ST-E00129:1195:HHMK3CCX2:1:1101:1834:1397 1:N:0:NATCAGCCTA+NGGACGAAAC
NATCAGCCTA
marta@cyanobacteria:/data/merged_fastq$ zcat UMH-MO-g001_S2_L001_I2_001.fastq.gz | head
@ST-E00129:1195:HHMK3CCX2:1:1101:1773:1397 2:N:0:NATCAGCCTA+NGGACGAAAC
NGGACGAAAC
+
#A<AF<FJJJ
@ST-E00129:1195:HHMK3CCX2:1:1101:1813:1397 2:N:0:NATCAGCCTA+NGGACGAAAC
NGGACGAAAC
+
#AAFFJJJJJ
@ST-E00129:1195:HHMK3CCX2:1:1101:1834:1397 2:N:0:NATCAGCCTA+NGGACGAAAC
NGGACGAAAC
cziegenhain commented 2 years ago

Hi,

Yes you can just concatenate your fastq files for each R1, R2, I1, I2 (as long as you are sure that the same barcodes did not get reused in the 2nd library).

Best, Christoph