sdparekh / zUMIs

zUMIs: A fast and flexible pipeline to process RNA sequencing data with UMIs
GNU General Public License v3.0
275 stars 67 forks source link

Implement zUMI for dual index protocol #20

Closed aemoor closed 6 years ago

aemoor commented 6 years ago

Hi, I have scRNAseq data with cell barcode and UMI in the Barcode read and and an additional plate barcode at the end of the cDNA read. I can extract the plate barcode from the cDNA read in a preprocessing step and use the STRT dual index input mode. I would like to get the results separately for each plate index. Can I use zUMI for processing such samples with these requirements or should I create a custom pipeline?

Thanks

cziegenhain commented 6 years ago

Hey,

thanks for your question! Unfortunately, there is no elegant solution, as in a mode for generating the output per plate barcode yet. But I think it's a good idea and we could implement it in the future!

As for now, I usually use a small perl script to append our plate barcodes (Illumina i7-Index) as a prefix to our cell barcode and just use it as a longer cell barcode (ie i7+cellBC). We can upload that script for you if you'd like, it's pretty fast! Doing this and if you know your barcode, you could then run the counting & stats steps of zUMIs (use '-w Counting' as an option) giving one set of barcodes (plate1BC+all cell barcodes, plate2BC+all cell barcodes, etc...) at a time. That'd at least save you from running the preprocessing and mapping steps several times.

Thanks for your input and let me know if we can assist further.

Best, Christoph

aemoor commented 6 years ago

Hi Christoph,

Thanks for your quick response. For now I will use Bcl2Fastq for splitting the plates based on their plate-barcode and will then run zUMI on each of them in parallel.

I tried to run your example data on my LSF cluster and ran into the following error (that I could not interpret so far), do you recognize by change what went wrong?

Thanks, Andreas

Your jobs will run on this machine.

Make sure you have more than 2.2G RAM and 8 processors available.

Your jobs will be started from filtering.

 You provided these parameters:
 SLURM workload manager:        no
 Summary Stats to produce:      yes
 Start the pipeline from:       filtering
 A custom mapped BAM:           NA
 Custom filtered FASTQ:         no
 Barcode read:                  barcoderead_HEK.1mio.fq.gz
 cDNA read:                     cDNAread_HEK.1mio.fq.gz
 Study/sample name:             test3
 Output directory:              /home/labs/shalev/amoor5/temp/zumi_test
 Cell/sample barcode range:     1-6
 UMI barcode range:             7-16
 Retain cell with >=N reads:    100
 Genome directory:              reference/hg38_chr22/
 GTF annotation file:           reference/GRCh38.84.chr22.gtf
 Number of processors:          8
 Read length:                   50
 Strandedness:                  0
 Cell barcode Phred:            20
 UMI barcode Phred:             20
 # bases below phred in CellBC: 1
 # bases below phred in UMI:    1
 Barcodes:                      NA
 zUMIs directory:               /apps/RH7U2/general/zUMIs/2017-12-24/
 STAR executable                STAR
 samtools executable            samtools
 pigz executable                pigz
 Rscript executable             Rscript
 Additional STAR parameters:
 STRT-seq data:                 no
 InDrops data:                  no
 Library read for InDrops:      NA
 Barcode read2(STRT-seq):       NA
 Barcode read2 range(STRT-seq): 0-0
 Bases(G) to trim(STRT-seq):    3
 Subsampling reads:             0

Raw reads: 1000000
Filtered reads: 853296

Jan 28 14:53:52 ..... started STAR run
Jan 28 14:53:52 ..... loading genome
Jan 28 14:53:55 ..... processing annotations GTF
Jan 28 14:53:55 ..... inserting junctions into the genome indices
Jan 28 14:54:08 ..... started 1st pass mapping
Jan 28 14:55:53 ..... finished 1st pass mapping
Jan 28 14:55:54 ..... inserting junctions into the genome indices
Jan 28 14:56:08 ..... started mapping
Jan 28 14:57:51 ..... finished successfully
Loading required package: optparse
[1] "I am loading useful packages..."
[1] "2018-01-28 14:58:08 IST"
[1] "I am loading Rsubread 1.26.0..."
[1] "I am making annotations in SAF... This will take less than 3 minutes..."
[1] "2018-01-28 14:58:22 IST"
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .get_cds_IDX(type, phase) :
  The "phase" metadata column contains non-NA values for features of type
  stop_codon. This information was ignored.
'select()' returned 1:many mapping between keys and columns
[1] "I am making count tables...This will take a while!!"
[1] "2018-01-28 14:58:30 IST"

        ==========     _____ _    _ ____  _____  ______          _____
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
       Rsubread 1.26.0

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           S /home/labs/shalev/amoor5/temp/zumi_test/te ... ||
||                                                                            ||
||      Dir for temp files : .                                                ||
||                 Threads : 1                                                ||
||                   Level : meta-feature level                               ||
||              Paired-end : no                                               ||
||         Strand specific : no                                               ||
||      Multimapping reads : primary only                                     ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file ./.Rsubread_UserProvidedAnnotation_pid2475 ...        ||
||    Features : 5737                                                         ||
||    Meta-features : 698                                                     ||
||    Chromosomes/contigs : 3                                                 ||
||                                                                            ||
|| Process BAM file /home/labs/shalev/amoor5/temp/zumi_test/test3.aligned ... ||
||    Single-end reads are included.                                          ||
||    Assign reads to features...                                             ||
||    Total reads : 853296                                                    ||
||    Successfully assigned reads : 46553 (5.5%)                              ||
||    Running time : 0.04 minutes                                             ||
||                                                                            ||
||                         Read assignment finished.                          ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

        ==========     _____ _    _ ____  _____  ______          _____
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
       Rsubread 1.26.0

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           S /home/labs/shalev/amoor5/temp/zumi_test/te ... ||
||                                                                            ||
||      Dir for temp files : .                                                ||
||                 Threads : 1                                                ||
||                   Level : meta-feature level                               ||
||              Paired-end : no                                               ||
||         Strand specific : no                                               ||
||      Multimapping reads : primary only                                     ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file ./.Rsubread_UserProvidedAnnotation_pid2475 ...        ||
||    Features : 26131                                                        ||
||    Meta-features : 1190                                                    ||
||    Chromosomes/contigs : 4                                                 ||
||                                                                            ||
|| Process BAM file /home/labs/shalev/amoor5/temp/zumi_test/test3.aligned ... ||
||    Single-end reads are included.                                          ||
||    Assign reads to features...                                             ||
||    Total reads : 853296                                                    ||
||    Successfully assigned reads : 21890 (2.6%)                              ||
||    Running time : 0.03 minutes                                             ||
||                                                                            ||
||                         Read assignment finished.                          ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

Error in if (sigmasq > signif(.Machine$double.xmax, 6)) { :
  missing value where TRUE/FALSE needed
Calls: makeGEprofile ... eval -> mclustBIC -> mstep -> eval -> eval -> mstepE
Execution halted
[1] "I am loading useful packages for plotting..."
[1] "2018-01-28 14:58:46 IST"
Error in gzfile(file, "rb") : cannot open the connection
Calls: readRDS -> gzfile
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file '/home/labs/shalev/amoor5/temp/zumi_test/zUMIs_output/expression/test3.dgecounts.rds', probable reason 'No such file or directory'
Execution halted

Based on an earlier error report I looked at the STAR alignment stats for my test run, they look exactly like your reported ones:

test3.Log.final.out
                                 Started job on |   Jan 28 14:53:52
                             Started mapping on |   Jan 28 14:56:08
                                    Finished on |   Jan 28 14:57:51
       Mapping speed, Million of reads per hour |   29.82

                          Number of input reads |   853296
                      Average input read length |   50
                                    UNIQUE READS:
                   Uniquely mapped reads number |   130719
                        Uniquely mapped reads % |   15.32%
                          Average mapped length |   48.41
                       Number of splices: Total |   8929
            Number of splices: Annotated (sjdb) |   4497
                       Number of splices: GT/AG |   8244
                       Number of splices: GC/AG |   185
                       Number of splices: AT/AC |   3
               Number of splices: Non-canonical |   497
                      Mismatch rate per base, % |   5.04%
                         Deletion rate per base |   0.01%
                        Deletion average length |   1.42
                        Insertion rate per base |   0.10%
                       Insertion average length |   2.07
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |   78370
             % of reads mapped to multiple loci |   9.18%
        Number of reads mapped to too many loci |   83
             % of reads mapped to too many loci |   0.01%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |   0.00%
                 % of reads unmapped: too short |   75.43%
                     % of reads unmapped: other |   0.06%
                                  CHIMERIC READS:
                       Number of chimeric reads |   0
                            % of chimeric reads |   0.00%
cziegenhain commented 6 years ago

Hey Andreas,

yes splitting with Bcl2Fastq and parallel runs would be a good workaround!

The error seems to be the barcode selection step. We've heard of issues with that before, so please check the following:

Thanks for using zUMIs!

aemoor commented 6 years ago

Thanks again for the quick response. You're right, it seems to happen in the barcode selection. For my real data I know the barcodes, so I submitted the -b flag with a list of barcodes for your example data. My R is 3.4.3 and we installed the last zUMI release. I now only get the following error (see beneath). Is this something you know?

Your jobs will be started from filtering.

 You provided these parameters:
 SLURM workload manager:        no
 Summary Stats to produce:      yes
 Start the pipeline from:       filtering
 A custom mapped BAM:           NA
 Custom filtered FASTQ:         no
 Barcode read:                  barcoderead_HEK.1mio.fq.gz
 cDNA read:                     cDNAread_HEK.1mio.fq.gz
 Study/sample name:             test3
 Output directory:              /home/labs/shalev/amoor5/temp/zumi_test
 Cell/sample barcode range:     1-6
 UMI barcode range:             7-16
 Retain cell with >=N reads:    100
 Genome directory:              reference/hg38_chr22/
 GTF annotation file:           reference/GRCh38.84.chr22.gtf
 Number of processors:          8
 Read length:                   50
 Strandedness:                  0
 Cell barcode Phred:            20
 UMI barcode Phred:             20
 # bases below phred in CellBC: 1
 # bases below phred in UMI:    1
 Barcodes:                      barcodefile.txt
 zUMIs directory:               /apps/RH7U2/general/zUMIs/2017-12-24/
 STAR executable                STAR
 samtools executable            samtools
 pigz executable                pigz
 Rscript executable             Rscript
 Additional STAR parameters:
 STRT-seq data:                 no
 InDrops data:                  no
 Library read for InDrops:      NA
 Barcode read2(STRT-seq):       NA
 Barcode read2 range(STRT-seq): 0-0
 Bases(G) to trim(STRT-seq):    3
 Subsampling reads:             0

Raw reads: 1000000
Filtered reads: 853296

Jan 28 16:26:37 ..... started STAR run
Jan 28 16:26:37 ..... loading genome
Jan 28 16:26:59 ..... processing annotations GTF
Jan 28 16:27:00 ..... inserting junctions into the genome indices
Jan 28 16:27:15 ..... started 1st pass mapping
Jan 28 16:29:13 ..... finished 1st pass mapping
Jan 28 16:29:13 ..... inserting junctions into the genome indices
Jan 28 16:29:26 ..... started mapping
Jan 28 16:31:11 ..... finished successfully
Loading required package: optparse
[1] "I am loading useful packages..."
[1] "2018-01-28 16:31:34 IST"
[1] "I am loading Rsubread 1.26.0..."
[1] "The barcodes file does not exist!"
[1] "I am loading useful packages for plotting..."
[1] "2018-01-28 16:31:51 IST"
Error in gzfile(file, "rb") : cannot open the connection
Calls: readRDS -> gzfile
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file '/home/labs/shalev/amoor5/temp/zumi_test/zUMIs_output/expression/test3.dgecounts.rds', probable reason 'No such file or directory'
Execution halted
cziegenhain commented 6 years ago

Cool, your R version is the newest so should be good. Here it says that barcodefile.txt could not be found or opened, please check the path! If the path was correct, this seems to confirm that it's not really a problem of barcode selection but permissions on your server.

aemoor commented 6 years ago

Ok, thanks, I missed this line, it was a simple path issue. Now the pipeline works nicely but I have yet another small questions:

There is some gene filtering in the final output (each sample has different numbers of genes in the gene to UMI table). Can I switch this off / include all genes of the index in the table? This will make the merging of my plates easier.

Great work, really appreciate zUMI!

cziegenhain commented 6 years ago

Great, glad to know that its running for you now! Currently, genes are not filtered but we omit those without any counts in any of the cells.

When merging several samples, I like to use plyr::join_all, but you need to write the GeneIDs from row.names into one column. plyr::join_all(list(df1,df2,df3), by='GENE_ID', type="full")

Since zUMIs is running for you, I'm closing this issue but feel free to reopen or contact us for further help.

Best, Christoph

aemoor commented 6 years ago

Hi Christoph,

Thanks for your help. I'm running zUMI successfully on our existing pipeline, now I wanted to test it for your mcSCRB-seq protocol. I get the following error that I don't understand:

------------------------------------------------------------
# LSBATCH: User input
bash zUMIs-master.sh -f /home/labs/shalev/NGS/Sequencing_Runs/180206_NB551168_0072_AHJCHVBGX5/mcSCRB_test/100pg_A_S1_R1_001.fastq.gz -r /home/labs/shalev/NGS/Sequencing_Runs/180206_NB551168_0072_AHJCHVBGX5/mcSCRB_test/100pg_A_S1_R2_001.fastq.gz -c 1-6 -m 7-16 -l 48 -g /home/labs/shalev/NGS/indexes/GRCm38.84_STAR_zumi/ -b /home/labs/shalev/NGS/indexes/marsseq_cell_barcodes.txt -a /home/labs/shalev/NGS/indexes/GRCm38.84/Mus_musculus.GRCm38.84.gtf -n 100pg_a -p 16 -o /home/labs/shalev/NGS/Sequencing_Runs/180206_NB551168_0072_AHJCHVBGX5/mcSCRB_test/zUMI -i /apps/RH7U2/general/zUMIs/2017-12-24/
------------------------------------------------------------

Successfully completed.

Resource usage summary:

   CPU time :                                   1303.00 sec.
   Max Memory :                                 31187 MB
   Average Memory :                             14433.36 MB
   Total Requested Memory :                     80000.00 MB
   Delta Memory :                               48813.00 MB
   Max Swap :                                   2 MB
   Max Processes :                              7
   Max Threads :                                38
   Run time :                                   574 sec.
   Turnaround time :                            575 sec.

The output (if any) follows:

Your jobs will run on this machine. 

Make sure you have more than 30G RAM and 16 processors available. 

Your jobs will be started from filtering. 

You provided these parameters:
SLURM workload manager: no
Summary Stats to produce:   yes
Start the pipeline from:    filtering
A custom mapped BAM:        NA
Custom filtered FASTQ:      no
Barcode read:           /home/labs/shalev/NGS/Sequencing_Runs/180206_NB551168_0072_AHJCHVBGX5/mcSCRB_test/100pg_A_S1_R1_001.fastq.gz
cDNA read:          /home/labs/shalev/NGS/Sequencing_Runs/180206_NB551168_0072_AHJCHVBGX5/mcSCRB_test/100pg_A_S1_R2_001.fastq.gz
Study/sample name:      100pg_a
Output directory:       /home/labs/shalev/NGS/Sequencing_Runs/180206_NB551168_0072_AHJCHVBGX5/mcSCRB_test/zUMI
Cell/sample barcode range:  1-6
UMI barcode range:      7-16
Retain cell with >=N reads: 100
Genome directory:       /home/labs/shalev/NGS/indexes/GRCm38.84_STAR_zumi/
GTF annotation file:        /home/labs/shalev/NGS/indexes/GRCm38.84/Mus_musculus.GRCm38.84.gtf
Number of processors:       16
Read length:            48
Strandedness:           0
Cell barcode Phred:     20
UMI barcode Phred:      20
# bases below phred in CellBC:  1
# bases below phred in UMI: 1
Barcodes:           /home/labs/shalev/NGS/indexes/marsseq_cell_barcodes.txt
zUMIs directory:        /apps/RH7U2/general/zUMIs/2017-12-24/
STAR executable     STAR
samtools executable     samtools
pigz executable     pigz
Rscript executable      Rscript
Additional STAR parameters: 
STRT-seq data:          no
InDrops data:           no
Library read for InDrops:   NA
Barcode read2(STRT-seq):    NA
Barcode read2 range(STRT-seq):  0-0
Bases(G) to trim(STRT-seq): 3
Subsampling reads:      0 

Raw reads: 4517619 
Filtered reads: 3017862 

Feb 07 14:46:32 ..... started STAR run
Feb 07 14:46:33 ..... loading genome
Feb 07 14:47:59 ..... processing annotations GTF
Feb 07 14:48:11 ..... inserting junctions into the genome indices
Feb 07 14:49:58 ..... started 1st pass mapping
Feb 07 14:50:22 ..... finished 1st pass mapping
Feb 07 14:50:22 ..... inserting junctions into the genome indices
Feb 07 14:51:40 ..... started mapping
Feb 07 14:52:11 ..... finished successfully
Loading required package: optparse
[1] "I am loading useful packages..."
[1] "2018-02-07 14:52:49 IST"
[1] "I am loading Rsubread 1.26.0..."
[1] "I am making annotations in SAF... This will take less than 3 minutes..."
[1] "2018-02-07 14:53:07 IST"
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .get_cds_IDX(type, phase) :
 The "phase" metadata column contains non-NA values for features of type
 stop_codon. This information was ignored.
'select()' returned 1:many mapping between keys and columns
[1] "I am making count tables...This will take a while!!"
[1] "2018-02-07 14:54:06 IST"

       ==========     _____ _    _ ____  _____  ______          _____  
       =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
         =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
           ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
             ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
       ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
      Rsubread 1.26.0

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           S /home/labs/shalev/NGS/Sequencing_Runs/1802 ... ||
||                                                                            ||
||      Dir for temp files : .                                                ||
||                 Threads : 1                                                ||
||                   Level : meta-feature level                               ||
||              Paired-end : no                                               ||
||         Strand specific : no                                               ||
||      Multimapping reads : primary only                                     ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file ./.Rsubread_UserProvidedAnnotation_pid39627 ...       ||
||    Features : 225068                                                       ||
||    Meta-features : 25257                                                   ||
||    Chromosomes/contigs : 38                                                ||
||                                                                            ||
|| Process BAM file /home/labs/shalev/NGS/Sequencing_Runs/180206_NB551168 ... ||
||    Single-end reads are included.                                          ||
||    Assign reads to features...                                             ||
||    Total reads : 3017862                                                   ||
||    Successfully assigned reads : 306003 (10.1%)                            ||
||    Running time : 0.08 minutes                                             ||
||                                                                            ||
||                         Read assignment finished.                          ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

       ==========     _____ _    _ ____  _____  ______          _____  
       =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
         =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
           ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
             ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
       ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
      Rsubread 1.26.0

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           S /home/labs/shalev/NGS/Sequencing_Runs/1802 ... ||
||                                                                            ||
||      Dir for temp files : .                                                ||
||                 Threads : 1                                                ||
||                   Level : meta-feature level                               ||
||              Paired-end : no                                               ||
||         Strand specific : no                                               ||
||      Multimapping reads : primary only                                     ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file ./.Rsubread_UserProvidedAnnotation_pid39627 ...       ||
||    Features : 710016                                                       ||
||    Meta-features : 42143                                                   ||
||    Chromosomes/contigs : 45                                                ||
||                                                                            ||
|| Process BAM file /home/labs/shalev/NGS/Sequencing_Runs/180206_NB551168 ... ||
||    Single-end reads are included.                                          ||
||    Assign reads to features...                                             ||
||    Total reads : 3017862                                                   ||
||    Successfully assigned reads : 478574 (15.9%)                            ||
||    Running time : 0.08 minutes                                             ||
||                                                                            ||
||                         Read assignment finished.                          ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

Error in if (MAD_low < 0) { : missing value where TRUE/FALSE needed
Calls: makeGEprofile
Execution halted
[1] "I am loading useful packages for plotting..."
[1] "2018-02-07 14:54:31 IST"
Error in gzfile(file, "rb") : cannot open the connection
Calls: readRDS -> gzfile
In addition: Warning message:
In gzfile(file, "rb") :
 cannot open compressed file '/home/labs/shalev/NGS/Sequencing_Runs/180206_NB551168_0072_AHJCHVBGX5/mcSCRB_test/zUMI/zUMIs_output/expression/100pg_a.dgecounts.rds', probable reason 'No such file or directory'
Execution halted

I'm running the protocol on a single bulk sample (single cell barcode provided), I thought that this might cause issues, hence I added some dummy cell barcodes but I got twice the same error. Thanks Andreas

cziegenhain commented 6 years ago

Oh, yes I have indeed not really checked up on what happens if you have this few samples in the input. What breaks here is actually the adaptive downsampling function. Can you try to set it to something arbitrary, eg 100,000 reads? You'll still get the full output as usual anyway. -d 100000

Alternatively, could you please check what happens if you leave out barcodes -b in general?

Just a tip: if you want to restart this instance from the Counting without rerunning from the beginning, just supply -w Counting.

Great to hear that you are also giving the mcSCRB-seq protocol a try, I'd be more than happy to get feedback on that too. You're very welcome to mail me with comments, questions or experiences.

Thanks, Christoph

skannan4 commented 6 years ago

Hey Christoph, just wanted to follow up on this now that the new zUMIs version has support for plate indexing. In my case I have a list of known cell barcodes (96) and plate barcodes (8) - so if I want to pass a barcode list into zUMIs, should I include the combination of 768 barcodes, or just the cell barcodes? I tried the latter but ended up getting a counts table with all cell barcodes coalesced and the plate barcodes effectively ignored.

cziegenhain commented 6 years ago

Hey,

yes that's a very good question we should address in the wiki for usage. You need to pass all possible barcode combinations. The expected order is plateBC followed by cellBC. E.g. if your plate barcode is CGTACTAG and your cell barcode is ACTATT, give the following BC string: CGTACTAGACTATT

Hope that helps.

Best, Christoph

skannan4 commented 6 years ago

Thanks. To clarify - is this barcode list only used in the counting step or is it used earlier (for example to generate the filtered.barcodes.sam file)? Meaning - suppose I have completed through the mapping step using -b barcodes.txt where this file only gave the cellBCs - would I need to restart the workflow from the top or can I jump straight to the counting step with the updated barcode list?

cziegenhain commented 6 years ago

The barcode list is only used in the counting step, so just resume with -w Counting!

skannan4 commented 6 years ago

So I start with Counting, identical to what I did providing the cell barcodes, but now providing all of the possible barcode combinations. The subread feature counts works fine but then when it gets to subsampling and making the sparse matrix, I get this error: assignment of an object of class “numeric” is not valid for @‘Dim’ in an object of class “dgTMatrix”; is(value, "integer") is not TRUE Any ideas on this one?

As another note, should the filtered.barcodes.sam file contain information from the plate barcode reads? Checking this file gives me sequences of length of 16, which is the length of the cell barcode + UMI.

EDIT: I'm assuming that's the issue; fqfilter.pl should add the plate barcode info to the filtered.barcodes.sam file but that didn't happen for whatever reason. Probably threw an error later because none of the barcodes in my list would have been in the sam file. I'll investigate later to figure out why the plate barcodes didn't append but for the time being I'll just interleave the plate and cell barcode reads like you suggested above.

EDIT2: Ok, I figured it out. If you resume from the "mapping" stage with custom Fastq, zUMIs-master.sh ends up calling fqcheck.pl, which does not have the plate barcode support (unlike fqfilter.pl). I'll just tweak my way around it!

cziegenhain commented 6 years ago

In that case, I'd recommend to git clone zUMIs with the newest version again into your system to make sure your fqfilter.pl and zUMIs-dge.R are in sync!

skannan4 commented 6 years ago

Sorry if my edits weren't visible above - ended up being that at some point, I resumed from mapping, and zUMIs-master.sh called fqcheck,pl, which ended up overwriting my filtered.barcodes.sam file but without the plate barcode. I just started again from the top, should hopefully be fine this time!