Error with splitseq - Githubissues

minoda-lab / universc

UniverSC: a flexible cross-platform single-cell data processing pipeline

https://genomec.gsc.riken.jp/gerg/UniverSC/UniverSC_app_release/

GNU General Public License v3.0

43 stars 7 forks source link

Error with splitseq #10

Closed ayyildizd closed 1 year ago

ayyildizd commented 1 year ago

Hi!

I am trying to run your tool with splitseq technology however cellranger does not run properly.

First, I did not have I1 and I2 so I followed the guideline on your main page and created dummy I1 and I2 files using the first two indexes in whitelists/split-seq_round1_barcode.txt.

Later I run the pipeline, indicating technology 'splitseq'

What I have in my log file is looking like this:


script running in {toolsdir}/universc/launch_universc.sh...
... script called from {basedir}
Running launch_universc.sh in '{basedir}'
/usr/bin/which: no launch_universc.sh in ({toolsdir}/universc/launch_universc.sh:{toolsdir}/cellranger-3.0.2:/{homedir}/.local/bin:/{homedir}/bin:/gpfs/admin/hpc/sw/hpc/bin:/gpfs/admin/hpc/sw/hpc/sbin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin)
UniverSC Copyright (C) 2019 Tom Kelly; Kai Battenberg
This program comes with ABSOLUTELY NO WARRANTY; for details type 'cat LICENSE'. This is free software, and you are welcome to redistribute it under certain conditions; type 'cat LICENSE' for details.
Cell Ranger is called as third-party dependency and is not maintained by this project. Please ensure you comply with the End User License Agreement for all software installed where applicable; for details type 'cat README.md'.
    {basedir}/SRR16248559_S1_L001_R1_001.fastq file found
    {basedir}/SRR16248559_S1_L001_R2_001.fastq file found
    {basedir}/SRR16248559_S1_L001_I1_001.fastq file found
***WARNING: technology is set to splitseq. barcodes on Read 2 will be used***
Using 10x version 2 chemistry to support UMIs
***WARNING: conversion was turned on because directory input4cellranger_test_SRR16248559 was not found***
 checking if UniverSC is running already
  creating .lock file

#####Input information#####
SETUP and exit: false
FORMAT: splitseq
BARCODES: {basedir}whitelists/splitseq_barcode.txt
INPUT(R1):
 {basedir}/SRR16248559_S1_L001_R2_001.fastq
INPUT(R2):
 {basedir}/SRR16248559_S1_L001_R1_001.fastq
SAMPLE: SRR16248559
LANE: 1
ID: test_SRR16248559
DESCRIPTION: test_SRR16248559
***WARNING: no description given, setting to ID value***
REFERENCE: {genomedir}/reference_cellranger_3.0.2/refdata-cellranger-hg19-3.0.0
NCELLS: (no cell number given)
CHEMISTRY: SC3Pv2
JOBMODE: local
***WARNING: --jobmode "sge" is recommended if running script with qsub***
CONVERSION: true
##########

whitelist setup begin
updating barcodes in {toolsdir}/cellranger-3.0.2/cellranger-cs/3.0.2/lib/python/cellranger/barcodes for Cell Ranger version 3.0.2 installed in {toolsdir}/cellranger-3.0.2/cellranger ...
 restoring Cell Ranger
sed: can't read {toolsdir}/cellranger-3.0.2/cellranger-cs/3.0.2/lib/python/cellranger/check.py: No such file or directory
 {toolsdir}/cellranger-3.0.2/cellranger set for splitseq
 converting whitelist
barcode adjust: 0
 whitelist converted
verbose 
setup complete
running in local mode (no cluster configuration needed)
creating a folder for all Cell Ranger input files ...
 directory input4cellranger_test_SRR16248559 created for converted files
moving file to new location
 handling {basedir}/SRR16248559_S1_L001_R2_001.fastq ...
 handling {basedir}/SRR16248559_S1_L001_R1_001.fastq ...
handling {basedir}/SRR16248559_S1_L001_I1_001.fastq ...
converting input files to confer cellranger format ...
 adjustment parameters:
  barcodes: 0 bp at its head
  UMIs: 0 bp at its tail
 making technology-specific modifications ...
  ... remove adapter and phase blocks for splitseq
sed: -e expression #1, char 381: unknown option to `s'
 adjusting barcodes of R1 files
 adjusting UMIs of R1 files
running Cell Ranger ...

#####Cell Ranger command#####
cellranger count --id=test_SRR16248559\
        --fastqs=input4cellranger_test_SRR16248559\
        --lanes=1\
        --r1-length=34\
        --chemistry=SC3Pv2\
        --transcriptome={genomedir}/reference_cellranger_3.0.2/refdata-cellranger-hg19-3.0.0\
        --sample=SRR16248559\
        --description=test_SRR16248559\
        \
        --jobmode=local\
        \

##########
{toolsdir}/cellranger-3.0.2/cellranger-cs/3.0.2/bin
cellranger count (3.0.2)
Copyright (c) 2019 10x Genomics, Inc.  All rights reserved.
-------------------------------------------------------------------------------

Martian Runtime - '3.0.2-v3.2.0'
Serving UI at http://tcn194.local.snellius.surf.nl:34321?auth=ubrUG8xesSr8li0v-IDtLp4OYL4xnWqKFIPVkbW4AaE

Running preflight checks (please wait)...
2023-02-22 14:52:51 [runtime] (ready)           ID.test_SRR16248559.SC_RNA_COUNTER_CS.EXPAND_SAMPLE_DEF
2023-02-22 14:52:51 [runtime] (run:local)       ID.test_SRR16248559.SC_RNA_COUNTER_CS.EXPAND_SAMPLE_DEF.fork0.chnk0.main
2023-02-22 14:52:57 [runtime] (chunks_complete) ID.test_SRR16248559.SC_RNA_COUNTER_CS.EXPAND_SAMPLE_DEF
Checking sample info...
Checking FASTQ folder...
Checking reference...
Checking reference_path ({genomedir}/reference_cellranger_3.0.2/refdata-cellranger-hg19-3.0.0) on tcn194.local.snellius.surf.nl...
Checking chemistry...
Checking read 1 length...
Checking optional arguments...
mrc: '3.0.2-v3.2.0'

mrp: '3.0.2-v3.2.0'

Anaconda: Python 2.7.14 :: Anaconda, Inc.

numpy: 1.14.2

scipy: 1.0.1

pysam: 0.14.1

h5py: 2.8.0

pandas: 0.22.0

STAR: STAR_2.5.1b

samtools: samtools 1.7
Using htslib 1.7
Copyright (C) 2018 Genome Research Ltd.

2023-02-22 14:52:58 [runtime] (ready)           ID.test_SRR16248559.SC_RNA_COUNTER_CS.SC_RNA_COUNTER.DISABLE_FEATURE_STAGES
2023-02-22 14:52:58 [runtime] (run:local)       ID.test_SRR16248559.SC_RNA_COUNTER_CS.SC_RNA_COUNTER.DISABLE_FEATURE_STAGES.fork0.chnk0.main
2023-02-22 14:52:58 [runtime] (ready)           ID.test_SRR16248559.SC_RNA_COUNTER_CS.SC_RNA_COUNTER.SC_RNA_ANALYZER.CHOOSE_DIMENSION_REDUCTION
2023-02-22 14:52:58 [runtime] (run:local)       ID.test_SRR16248559.SC_RNA_COUNTER_CS.SC_RNA_COUNTER.SC_RNA_ANALYZER.CHOOSE_DIMENSION_REDUCTION.fork0.chnk0.main
2023-02-22 14:52:58 [runtime] (ready)           ID.test_SRR16248559.SC_RNA_COUNTER_CS.SC_RNA_COUNTER.CHEMISTRY_DETECTOR.DETECT_CHEMISTRY
2023-02-22 14:52:58 [runtime] (run:local)       ID.test_SRR16248559.SC_RNA_COUNTER_CS.SC_RNA_COUNTER.CHEMISTRY_DETECTOR.DETECT_CHEMISTRY.fork0.chnk0.main
2023-02-22 14:52:58 [runtime] (chunks_complete) ID.test_SRR16248559.SC_RNA_COUNTER_CS.SC_RNA_COUNTER.DISABLE_FEATURE_STAGES
2023-02-22 14:52:58 [runtime] (chunks_complete) ID.test_SRR16248559.SC_RNA_COUNTER_CS.SC_RNA_COUNTER.SC_RNA_ANALYZER.CHOOSE_DIMENSION_REDUCTION
2023-02-22 14:52:59 [runtime] (failed)          ID.test_SRR16248559.SC_RNA_COUNTER_CS.SC_RNA_COUNTER.CHEMISTRY_DETECTOR.DETECT_CHEMISTRY

[error] You selected chemistry 'SC3Pv2', which expects the cell barcode sequence in read R1.
In input data, an extremely low rate of correct barcodes was observed for this chemistry (0.00 %).
Please check your input data and chemistry selection. Note: manual chemistry detection is not required in most cases.
Input: {'lanes': [u'1'], 'sample_names': [u'SRR16248559'], 'sample_indices': None, 'fastq_mode': u'ILMN_BCL2FASTQ', 'read_path': u'{basedir}/input4cellranger_test_SRR16248559', 'interleaved': False}

2023-02-22 14:52:59 Shutting down.
Waiting 6 seconds for UI to do final refresh.
Saving pipestance info to test_SRR16248559/test_SRR16248559.mri.tgz
For assistance, upload this file to 10x Genomics by running:

cellranger upload <your_email> test_SRR16248559/test_SRR16248559.mri.tgz

cellranger run complete
***Notice: Cloupe file cannot be computed for splitseq
           Cloupe files generated by this pipeline are corrupt
           and cannot be read by the 10x Genomics Loupe Browser.
           We do not provide support for Cloupe files as this
           requires software from 10x Genomics subject to their
           End User License Agreement.
           Cloupe files are disabled in compliance with this.
updating .lock file
 no other jobs currently run by cellranger 3.0.2 in {toolsdir}/cellranger-3.0.2/cellranger
 no conflicts: whitelist can now be changed for other technologies
replacing modified barcodes with the original in the output gene barcode matrix
Can't open perl script "{basedir}/sub/RecoverBarcodes.pl": No such file or directory
barcodes recovered

#####Conversion tool log#####
cellranger 3.0.2

Original barcode format: splitseq (then converted to 10x)

cellranger runtime: 12s
##########`

TomKellyGenetics commented 1 year ago

Thanks for reporting this issue. It is a national holiday in Japan so I may take a few days to get back to you but I will investigate this. User feedback really helps us to troubleshoot these issues to support a wide range of technologies.

In the meantime can you please provide some more information on to help us to resolve the problem with the source code. Specifically, are you using the latest version v1.2.5.1 or an older one? A minimal example of your input files would help to test the update before releasing it. For example the first 20 lines of each fastq file will be sufficient.

The logs really help to narrow it down already.

... remove adapter and phase blocks for splitseq sed: -e expression #1, char 381: unknown option to `s'

Based on this error message, Kai was correct to ask me to handle this. I suspect it is a problem specific to this technology with this code that I am responsible for.

https://github.com/minoda-lab/universc/blob/c2d0c8848a49babfc5c465f27af388e9ccd2751a/launch_universc.sh#L3453-L3478

If you’ve not tried already, please pull the latest version from GitHub and try running this code. If there is still an error with it, I will need to update the source code. In this case I expect it is a minor syntax error introduced in recent updates so I should be able to fix it within a few days, I’ll update this thread if I manage to reproduce the error and test a solution in the development version.

TomKellyGenetics commented 1 year ago

Note that the latest release may solve this already as I found bug in version 1.2.4 (17 Sept 2022) affecting this section. https://github.com/minoda-lab/universc/commit/88e20dd547cf3f6fbd2f25982f8bb92b7332ec10

Sorry for the inconvenience caused by this but it may be this issue I am already aware of. Please try updating to version 1.2.5.1 (18 Jan 2023) and close this issue if it works. If you installed UniverSC during the above the period you may be affected by a syntax error that’s now been resolved.

ayyildizd commented 1 year ago

Thanks for quick resposnse and suggestions.

I am already using universcversion="1.2.5.1"

The first 20 lines of fastq files:

@SRR16248559.1 1 length=66
CTACANAACTCTCCACCTGAAATCAACAGAATATACATTCTTCTCAGCACCACGTCGCATTTATTC
+SRR16248559.1 1 length=66
AAAAA#EEEAEA/AEE/E/6EEEAEEEE/EE/EEEEEEEEEEEEEA//AAA////A/A</E<EEEE
@SRR16248559.2 2 length=66
TCTTGNAACACGGACCAAGGAGTCTAACACGTGCGCGAGTCGGGGGCTCGCACGAAAGCCGCCGTG
+SRR16248559.2 2 length=66
AAA/A#AEEEEAEAEEEE/EEE/E/EEEEAEAE/EEEAEEEEEEEEEEEEEEEE/EEEEAAEEEEE
@SRR16248559.3 3 length=66
CATCTNTAATCTCACTTCGTCTTTATAACCACCTGGAAGACTAGGAGTTACTACTCCCATTTTATA
+SRR16248559.3 3 length=66
/AAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEAE6
@SRR16248559.4 4 length=66
GGCTGNGGGCTTTCAGACGGAACGCAAGTGGTCAGAGGATGAAAAATGAGTTTTCTGATTGTTCTT
+SRR16248559.4 4 length=66
AAAAA#AEEEEEEEEEEEEE/EEEEEEEE6EEEEEEEEEEE6EEEEEEEE6EEEEEA/EEAAEEEE
@SRR16248559.5 5 length=66
GAATGNCATGTTGTTCACAATTGTATGTTGAAATTGAACACAGTAATGATAAGCACTAAAAAAAAA
+SRR16248559.5 5 length=66
AAAAA#EAEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEE

and Read2

@SRR16248559.1 1 length=94
NNCGTCAGGACAAGGAGCTTGCACGATGAGTAGCATCACCCTACGACTATGCCTAAATCCACGTGCTTGAGAGGCCAGAGCATTCGAGTACAAG
+SRR16248559.1 1 length=94
##AAAAEAEAAEE6EEEEA/////E//EE/<///6E6//A/6<<AEA<A/<EAEEEA<<EAEE6EE6<EEE/EE6AA<<A6/6<6<<EEEEEE/
@SRR16248559.2 2 length=94
NNATTATGGTGTGTTCTAGAGTTAGATGAGGCGCATCGGCGTACGACTCGGATTGCATCCACGTGCTTGAGAGGCCAGAGCATTCGCCTCTATC
+SRR16248559.2 2 length=94
##A/A/AEEEEEEEEEEE/E/////////E///EAE<//EA//6<E6/6EEEEE6E<6<E<AEAEE66A/E/EE<6A<<<666<6<6AEEE<EA
@SRR16248559.3 3 length=94
NNAGGGATATGTCTGTCAGTGGGCGATGGGGCGCATCGGCGGACGACTACAGATTCATCCACGTGCTTGAGAGGCCAGAGCATTCGTGAAGAGA
+SRR16248559.3 3 length=94
##AAAEEEEEEEEEEEEE///E////EE//////E/EEAEA/6//E<66AEEEEEEAA/<6EEAE<6AAAEEE6</<6AA666<66E<E<AEA/
@SRR16248559.4 4 length=94
NNCCTCCGAGACCACTGTCTGTACGCTGTGTAGCATCGGCGTAAGACTACATTGGCATCCACGTGCTTGAGAGGCCAGAGCATTCGCCGTGAGA
+SRR16248559.4 4 length=94
##6AAEE/E//EAEEEEE////E/A////AA//A</</AAE/6/<A<6AEEEEEAEAEA/6EEEEA/AEEEEEE<6A<AA666A6<AEEEEAE/
@SRR16248559.5 5 length=94
NNGCATAATGCGCATACAGCGGCCGCTGGGTCGCAGCGGCGGACGACTTTCACGCAATTCACGTGCTTGAGAGGCCAGAGCATTCGAGTACAAG
+SRR16248559.5 5 length=94
##AAAEEEEEEEEEEEEA//<//////A//<///</<<AEE/</AE<6/EEAEEEEAA/E/EEEEA6AE/EAEEAA/<AA666<6<EEAEEEEE

I want to add, I tried the tool with dropseq samples and that worked with no errors. The error must be only related with splitseq technology.

TomKellyGenetics commented 1 year ago

Thanks for sharing this information. I've managed to reproduce this issue on my system and confirmed that this codeblock specific to this technology is causing the problem. The files were correctly created in the input4cellranger directory and after executing this code the R2 FASTQ file is empty due to the SED command giving invalid output.

This means I can test solution on my system and update the source code on GitHub when I have a solution. I'll push it to the "dev" branch for the development version and notify you when it is read. Apologies for the inconvenience. It appears to be an oversight on my part when integrating this technology into the pipeline.

Reassuringly, no other technologies should be affect by this issue as you've noted. We tested it extensively with Drop-Seq data (this was our original motivation to create UniverSC in the 1st place actually) as I am pleased to hear that others such as yourself recognise the need for this and it is working for them.

TomKellyGenetics commented 1 year ago

@kbattenb I confirmed that it is the 2nd sed call in this subroutine that is failing, the 1st works as expected. I will handle this and check the split-seq specifications to ensure it is correct.

I've had some issues with NCBI SRA IDs in the FASTQ headers before but I think I have addressed them while testing published data for the paper.

Note this study uses NextSeq which sequences the reverse complement of the R2 sequence so the barcodes are read in reverse order. The 2nd SED call is intended to address this possibility but it fails due to mismatches in adapter sequences. I'll update the regular expressions to account for this. https://www.ncbi.nlm.nih.gov/sra/?term=SRR16248559 https://www.ncbi.nlm.nih.gov/bioproject/PRJNA769637 https://www.nature.com/articles/s41586-022-04912-w

SRX12528354: GSM5618237: Specimen_7; Homo sapiens; RNA-Seq 2 ILLUMINA (NextSeq 550) runs: 222.3M spots, 35.6G bases, 15.3Gb downloads

Submitted by: NCBI (GEO) Study: Dissecting the transcriptome landscape of the human hippocampus PRJNA769637 • SRP340513 • All experiments • All runs show Abstract Sample: Specimen_7 SAMN22155633 • SRS10488618 • All experiments • All runs Organism: Homo sapiens Library: Instrument: NextSeq 550 Strategy: RNA-Seq Source: TRANSCRIPTOMIC Selection: cDNA Layout: PAIRED

Note this protocol uses the Split-Seq method (Rosenberg et al., 2018) with modifications. This may be the Split-Seq v2 referred to here (which has different adapter sequences). https://github.com/COMBINE-lab/salmon/issues/699#issuecomment-951080577

It should be possible to support this but I may need more time to get a demo working.

TomKellyGenetics commented 1 year ago

The development version supports Split-Seq v1 adapters. You can try it with:

git pull https://github.com/minoda-lab/universc.git dev
git checkout dev

I can run UniverSC on this public data and call Cell Ranger without errors. This corrects a syntax error for handling quality scores (introduced when correcting bugs discussed in v1.2.3.4 discussed above) and ensures that adapter sequences are removed by correctly matching sequences given in the above example. Thanks to your feedback we are able to support additional techniques like this.

Note this does not support Split-Seq v2 adapters (yet). The public data provided has longer adapters expected for Split-Seq v1 cited in Rosenberg et al. (2018). Some mismatched adapter sequences are permitted but frameshifts will cause mismatched barcodes to be skipped as barcodes are assumed to be fixed distance apart (consistent with how salmon/alevin and zUMIs handles this). The "NN" bases at the beginning on R2 sequences are automatically removed.

If the adapters do not match it will be skipped and attempt to use the reverse complement. UMI is automatically moved to the end after barcode sequences and barcode orders (B3-B1) is corrected (to B1-B3). However, each BC or UMI sequence will still be reverse complement. It may be necessary to use a barcode whitelist for Split-Seq with barcode sequences in the reverse complement (try this if the number of cells detected is far lower than expected). No change is needed for UMI as they will still be unique sequences.

Please try running the "dev" version (1.2.5.2-dev) and contact us if you have trouble.

ayyildizd commented 1 year ago

Hi Tom,

Thanks for the effort. I pulled the dev version and run the same sample above and unfortunately still doesn’t working. I saw that R2 is still empty. Do you think it is a matter of the white list?

Please see the log below:

script running in {toolsdir}/universc_dev/launch_universc.sh...
... script called from {projectdir}
Running launch_universc.sh in '{projectdir}'
/usr/bin/which: no launch_universc.sh in ({toolsdir}/universc_dev/launch_universc.sh:{toolsdir}/cellranger-3.0.2:{homedir}/.local/bin:{homedir}/bin:/gpfs/admin/hpc/sw/hpc/bin:/gpfs/admin/hpc/sw/hpc/sbin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin)
UniverSC Copyright (C) 2019 Tom Kelly; Kai Battenberg
This program comes with ABSOLUTELY NO WARRANTY; for details type 'cat LICENSE'. This is free software, and you are welcome to redistribute it under certain conditions; type 'cat LICENSE' for details.
Cell Ranger is called as third-party dependency and is not maintained by this project. Please ensure you comply with the End User License Agreement for all software installed where applicable; for details type 'cat README.md'.
***WARNING: technology is set to splitseq. barcodes on Read 2 will be used***
basename: missing operand
Try 'basename --help' for more information.
cut: invalid field range
Try 'cut --help' for more information.
***WARNING: filename  is not following the naming convention. (e.g. EXAMPLE_S1_L001_R1_001.fastq)***
basename: missing operand
Try 'basename --help' for more information.
cut: invalid field range
Try 'cut --help' for more information.
***WARNING: filename  is not following the naming convention. (e.g. EXAMPLE_S1_L001_R1_001.fastq)***
Error: option --reference is required
script running in {toolsdir}/universc/launch_universc.sh...
... script called from {projectdir}
Running launch_universc.sh in '{projectdir}'
/usr/bin/which: no launch_universc.sh in ({toolsdir}/universc_dev/launch_universc.sh:{toolsdir}/cellranger-3.0.2:{homedir}/.local/bin:{homedir}/bin:/gpfs/admin/hpc/sw/hpc/bin:/gpfs/admin/hpc/sw/hpc/sbin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin)
UniverSC Copyright (C) 2019 Tom Kelly; Kai Battenberg
This program comes with ABSOLUTELY NO WARRANTY; for details type 'cat LICENSE'. This is free software, and you are welcome to redistribute it under certain conditions; type 'cat LICENSE' for details.
Cell Ranger is called as third-party dependency and is not maintained by this project. Please ensure you comply with the End User License Agreement for all software installed where applicable; for details type 'cat README.md'.
    {homedir}/{projectdir}/fastq_files/SRR16248559_S1_L001_R1_001.fastq file found
    {homedir}/{projectdir}/fastq_files/SRR16248559_S1_L001_R2_001.fastq file found
    {homedir}/{projectdir}/fastq_files/SRR16248559_S1_L001_I1_001.fastq file found
***WARNING: technology is set to splitseq. barcodes on Read 2 will be used***
Using 10x version 2 chemistry to support UMIs
***WARNING: conversion was turned on because directory input4cellranger_test_SRR16248559 was not found***
 checking if UniverSC is running already
  creating .lock file

#####Input information#####
SETUP and exit: false
FORMAT: splitseq
BARCODES: {projectdir}/whitelists/splitseq_barcode.txt
INPUT(R1):
 {homedir}/{projectdir}/fastq_files/SRR16248559_S1_L001_R2_001.fastq
INPUT(R2):
 {homedir}/{projectdir}/fastq_files/SRR16248559_S1_L001_R1_001.fastq
SAMPLE: SRR16248559
LANE: 1
ID: test_SRR16248559
DESCRIPTION: test_SRR16248559
***WARNING: no description given, setting to ID value***
REFERENCE: {toolsdir}/genomes/reference_cellranger_3.0.2/GRCh38-3.0.0.premrna
NCELLS: (no cell number given)
CHEMISTRY: SC3Pv2
JOBMODE: local
***WARNING: --jobmode "sge" is recommended if running script with qsub***
CONVERSION: true
##########

whitelist setup begin
updating barcodes in {toolsdir}/cellranger-3.0.2/cellranger-cs/3.0.2/lib/python/cellranger/barcodes for Cell Ranger version 3.0.2 installed in {toolsdir}/cellranger-3.0.2/cellranger ...
 restoring Cell Ranger
sed: can't read {toolsdir}/cellranger-3.0.2/cellranger-cs/3.0.2/lib/python/cellranger/check.py: No such file or directory
 {toolsdir}/cellranger-3.0.2/cellranger set for splitseq
 converting whitelist
barcode adjust: 0
 whitelist converted
verbose 
setup complete
running in local mode (no cluster configuration needed)
creating a folder for all Cell Ranger input files ...
 directory input4cellranger_test_SRR16248559 created for converted files
moving file to new location
 handling {homedir}/{projectdir}/fastq_files/SRR16248559_S1_L001_R2_001.fastq ...
 handling {homedir}/{projectdir}/fastq_files/SRR16248559_S1_L001_R1_001.fastq ...
handling {homedir}/{projectdir}/fastq_files/SRR16248559_S1_L001_I1_001.fastq ...
converting input files to confer cellranger format ...
 adjustment parameters:
  barcodes: 0 bp at its head
  UMIs: 0 bp at its tail
 making technology-specific modifications ...
  ... remove adapter and phase blocks for splitseq
sed: -e expression #1, char 381: unknown option to `s'
 adjusting barcodes of R1 files
 adjusting UMIs of R1 files
running Cell Ranger ...

#####Cell Ranger command#####
cellranger count --id=test_SRR16248559\
        --fastqs=input4cellranger_test_SRR16248559\
        --lanes=1\
        --r1-length=34\
        --chemistry=SC3Pv2\
        --transcriptome={toolsdir}/genomes/reference_cellranger_3.0.2/GRCh38-3.0.0.premrna\
        --sample=SRR16248559\
        --description=test_SRR16248559\
        \
        --jobmode=local\
        \

##########
{toolsdir}/cellranger-3.0.2/cellranger-cs/3.0.2/bin
cellranger count (3.0.2)
Copyright (c) 2019 10x Genomics, Inc.  All rights reserved.
-------------------------------------------------------------------------------

Martian Runtime - '3.0.2-v3.2.0'
Serving UI at http://tcn381.local.snellius.surf.nl:35541?auth=ON5vpEJOVmxOrq9JDci5-zvPxC8pBU1UeM6w1dlMEUE

Running preflight checks (please wait)...
2023-02-28 09:42:02 [runtime] (ready)           ID.test_SRR16248559.SC_RNA_COUNTER_CS.EXPAND_SAMPLE_DEF
2023-02-28 09:42:02 [runtime] (run:local)       ID.test_SRR16248559.SC_RNA_COUNTER_CS.EXPAND_SAMPLE_DEF.fork0.chnk0.main
2023-02-28 09:42:10 [runtime] (chunks_complete) ID.test_SRR16248559.SC_RNA_COUNTER_CS.EXPAND_SAMPLE_DEF
Checking sample info...
Checking FASTQ folder...
Checking reference...
Checking reference_path ({toolsdir}/genomes/reference_cellranger_3.0.2/GRCh38-3.0.0.premrna) on tcn381.local.snellius.surf.nl...
Checking chemistry...
Checking read 1 length...
Checking optional arguments...
mrc: '3.0.2-v3.2.0'

mrp: '3.0.2-v3.2.0'

Anaconda: Python 2.7.14 :: Anaconda, Inc.

numpy: 1.14.2

scipy: 1.0.1

pysam: 0.14.1

h5py: 2.8.0

pandas: 0.22.0

STAR: STAR_2.5.1b

samtools: samtools 1.7
Using htslib 1.7
Copyright (C) 2018 Genome Research Ltd.

2023-02-28 09:42:11 [runtime] (ready)           ID.test_SRR16248559.SC_RNA_COUNTER_CS.SC_RNA_COUNTER.SC_RNA_ANALYZER.CHOOSE_DIMENSION_REDUCTION
2023-02-28 09:42:11 [runtime] (run:local)       ID.test_SRR16248559.SC_RNA_COUNTER_CS.SC_RNA_COUNTER.SC_RNA_ANALYZER.CHOOSE_DIMENSION_REDUCTION.fork0.chnk0.main
2023-02-28 09:42:11 [runtime] (ready)           ID.test_SRR16248559.SC_RNA_COUNTER_CS.SC_RNA_COUNTER.DISABLE_FEATURE_STAGES
2023-02-28 09:42:11 [runtime] (run:local)       ID.test_SRR16248559.SC_RNA_COUNTER_CS.SC_RNA_COUNTER.DISABLE_FEATURE_STAGES.fork0.chnk0.main
2023-02-28 09:42:11 [runtime] (ready)           ID.test_SRR16248559.SC_RNA_COUNTER_CS.SC_RNA_COUNTER.CHEMISTRY_DETECTOR.DETECT_CHEMISTRY
2023-02-28 09:42:11 [runtime] (run:local)       ID.test_SRR16248559.SC_RNA_COUNTER_CS.SC_RNA_COUNTER.CHEMISTRY_DETECTOR.DETECT_CHEMISTRY.fork0.chnk0.main
2023-02-28 09:42:11 [runtime] (chunks_complete) ID.test_SRR16248559.SC_RNA_COUNTER_CS.SC_RNA_COUNTER.SC_RNA_ANALYZER.CHOOSE_DIMENSION_REDUCTION
2023-02-28 09:42:11 [runtime] (chunks_complete) ID.test_SRR16248559.SC_RNA_COUNTER_CS.SC_RNA_COUNTER.DISABLE_FEATURE_STAGES
2023-02-28 09:42:12 [runtime] (failed)          ID.test_SRR16248559.SC_RNA_COUNTER_CS.SC_RNA_COUNTER.CHEMISTRY_DETECTOR.DETECT_CHEMISTRY

[error] You selected chemistry 'SC3Pv2', which expects the cell barcode sequence in read R1.
In input data, an extremely low rate of correct barcodes was observed for this chemistry (0.00 %).
Please check your input data and chemistry selection. Note: manual chemistry detection is not required in most cases.
Input: {'lanes': [u'1'], 'sample_names': [u'SRR16248559'], 'sample_indices': None, 'fastq_mode': u'ILMN_BCL2FASTQ', 'read_path': u'{projectdir}/input4cellranger_test_SRR16248559', 'interleaved': False}

2023-02-28 09:42:12 Shutting down.
Waiting 6 seconds for UI to do final refresh.
Saving pipestance info to test_SRR16248559/test_SRR16248559.mri.tgz
For assistance, upload this file to 10x Genomics by running:

cellranger upload <your_email> test_SRR16248559/test_SRR16248559.mri.tgz

cellranger run complete
***Notice: Cloupe file cannot be computed for splitseq
           Cloupe files generated by this pipeline are corrupt
           and cannot be read by the 10x Genomics Loupe Browser.
           We do not provide support for Cloupe files as this
           requires software from 10x Genomics subject to their
           End User License Agreement.
           Cloupe files are disabled in compliance with this.
updating .lock file
 no other jobs currently run by cellranger 3.0.2 in {toolsdir}/cellranger-3.0.2/cellranger
 no conflicts: whitelist can now be changed for other technologies
replacing modified barcodes with the original in the output gene barcode matrix
Can't open perl script "{projectdir}/sub/RecoverBarcodes.pl": No such file or directory
barcodes recovered

#####Conversion tool log#####
cellranger 3.0.2

Original barcode format: splitseq (then converted to 10x)

cellranger runtime: 15s
##########

TomKellyGenetics commented 1 year ago

Sorry I am afraid it appears to be the same error. I think it may be a problem merging changes from the development branch with git, please try running git checkout dev (change to dev branch) or git merge dev (add changes to master branch).

You can verify the branch or version is correct as follows:

git branch -v —list
bash launch_universc.sh —version

This should display which branch you are using and the dev version number above. You may need to press “q” to quit the branch list, the current branch is listed with “*”.

Note you may also need to update the barcode whitelist. Ideally I’d like to update the default whitelist for this technology but you can use custom inputs to test a whitelist if that is faster than waiting for us developers to add support for it.

Provided each cell has a unique barcode and they match the whitelist, it should give valid QC and results. You may need to run rev or cat file.txt | tr “ATCG” “TAGC” > new file.txt to generate the reverse complement for each barcode. It is also possible to generate combinations of barcodes 1-3 (8bp each) with cut or substr and join. I’d advise against using all permutations for the whitelist for barcodes as long as 24bp as the memory requirements will be far higher than filtering for known valid barcodes. All combinations barcodes 1-3 if the 8bp sequences are known is computationally feasible.

This command resets Cell Ranger to original settings and then forces configuration of the new whitelist (in case it conflicts with your existing installation). This may also be useful to switch to dropseq settings. You may also need to delete the .lock files listed in the logs to run a new technology if one aborted without completing (only do this if you know no other UniverSC runs are in progress).

bash launch_universc.sh -t "10x" --setup
bash launch_universc.sh -t "splitseq" -b ./path_to/my_bc_file.txt --setup

I’ve added support for both versions of splitseq but it is backwards compatible(splitseq or splitseq-v1 are aliases for the same setting). Use v1 not v2 for this data.

ayyildizd commented 1 year ago

Hi Tom,

I guess the problem is I am not able to pull dev branch

$ git pull https://github.com/minoda-lab/universc.git dev
From https://github.com/minoda-lab/universc
 * branch            dev        -> FETCH_HEAD
Already up to date.
$ git checkout dev
error: pathspec 'dev' did not match any file(s) known to git

TomKellyGenetics commented 1 year ago

Oh I see now, I’m relieved actually. Hopefully the updated script will work for you once you have it merged.

As for git settings, I think the problem is cloning the repo only copied the master branch. You’ll need to create the dev branch on your local repository and pull updates.

git checkout -b dev HEAD
git remote add upstream https://github.com/minoda-lab/universc.git
git fetch upstream dev
git merge upstream/dev

If the issue persists with the updated script, let me know and I will try to test it again. A common issue is that sequence and quality scores are different lengths in converted fastq files but it should be avoided in this case.

ayyildizd commented 1 year ago

Thanks Tom!

I created a white list using all possible permutations of barcodes 8bp long.

Next I tried this code:

bash launch_universc.sh -t "splitseq" -b ./path_to/AllPossibilities_8_barcodes.txt --setup

Setup flag works with 10x but not with splitseq, it expects to see all the files:

***WARNING: technology is set to splitseq. barcodes on Read 2 will be used***
basename: missing operand
Try 'basename --help' for more information.
cut: invalid field range
Try 'cut --help' for more information.
***WARNING: filename  is not following the naming convention. (e.g. EXAMPLE_S1_L001_R1_001.fastq)***
basename: missing operand
Try 'basename --help' for more information.
cut: invalid field range
Try 'cut --help' for more information.
***WARNING: filename  is not following the naming convention. (e.g. EXAMPLE_S1_L001_R1_001.fastq)***
Error: option --reference is required

Then I used your pipeline by mentioning this new whitelist with -b option. But This time I get a different error after initiation of cell ranger:

[error] Pipestance failed. Error log at:
test_SRR16248559/SC_RNA_COUNTER_CS/SC_RNA_COUNTER/_BASIC_SC_RNA_COUNTER/CHUNK_READS/fork0/chnk0-u5f60ff69b4/_errors

Log message:
Traceback (most recent call last):
  File "{toolsdir}/cellranger-3.0.2/martian-cs/v3.2.0/adapters/python/martian_shell.py", line 590, in _main
    stage.main()
  File "{toolsdir}/cellranger-3.0.2/martian-cs/v3.2.0/adapters/python/martian_shell.py", line 555, in main
    self._run(lambda: self._module.main(args, outs))
  File "{toolsdir}/cellranger-3.0.2/martian-cs/v3.2.0/adapters/python/martian_shell.py", line 524, in _run
    cmd()
  File "{toolsdir}/cellranger-3.0.2/martian-cs/v3.2.0/adapters/python/martian_shell.py", line 555, in <lambda>
    self._run(lambda: self._module.main(args, outs))
  File "{toolsdir}/cellranger-3.0.2/cellranger-cs/3.0.2/mro/stages/common/chunk_reads/__init__.py", line 53, in main
    tk_subproc.check_call(chunk_reads_args)
  File "{toolsdir}/cellranger-3.0.2/cellranger-cs/3.0.2/tenkit/lib/python/tenkit/log_subprocess.py", line 37, in check_call
    return subprocess.check_call(*args, **kwargs)
  File "{toolsdir}/cellranger-3.0.2/miniconda-cr-cs/4.3.21-miniconda-cr-cs-c10/lib/python2.7/subprocess.py", line 186, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '['chunk_reads', '--reads-per-fastq', '5000000', '{projectdir}/test_SRR16248559/SC_RNA_COUNTER_CS/SC_RNA_COUNTER/_BASIC_SC_RNA_COUNTER/CHUNK_READS/fork0/chnk0-u5f60ff69b4/files/', 'fastq_chunk', '--martian-args', 'chunk_args.json', '--compress', 'lz4']' returned non-zero exit status 1

Waiting 6 seconds for UI to do final refresh.
Pipestance failed. Use --noexit option to keep UI running after failure.

2023-03-01 16:06:29 Shutting down.
Saving pipestance info to test_SRR16248559/test_SRR16248559.mri.tgz
For assistance, upload this file to 10x Genomics by running:

cellranger upload <your_email> test_SRR16248559/test_SRR16248559.mri.tgz

cellranger run complete
***Notice: Cloupe file cannot be computed for splitseq
           Cloupe files generated by this pipeline are corrupt
           and cannot be read by the 10x Genomics Loupe Browser.
           We do not provide support for Cloupe files as this
           requires software from 10x Genomics subject to their
           End User License Agreement.
           Cloupe files are disabled in compliance with this.
updating .lock file
 no other jobs currently run by cellranger 3.0.2 in {toolsdir}/cellranger-3.0.2/cellranger
 no conflicts: whitelist can now be changed for other technologies
replacing modified barcodes with the original in the output gene barcode matrix
Can't open perl script "{projectdir}/sub/RecoverBarcodes.pl": No such file or directory
barcodes recovered

#####Conversion tool log#####
cellranger 3.0.2

Original barcode format: splitseq (then converted to 10x)

cellranger runtime: 95s
##########

TomKellyGenetics commented 1 year ago

Hi Dilara,

Thanks for reporting the issue with --setup, it seems to be a problem with expecting FASTQ inputs if the barcode file is given. This is unexpected behaviour. As a workaround, you can give it all input parameters and --setup will still force it to close without running. This is for convenience to avoid concurrent jobs conflicting, if you run 1 sample at a time, the script will automatically detect which technology ran before and update Cell Ranger settings as needed.

CalledProcessError: Command '['chunk_reads', '--reads-per-fastq', '5000000', '{projectdir}/test_SRR16248559/SC_RNA_COUNTER_CS/SC_RNA_COUNTER/_BASIC_SC_RNA_COUNTER/CHUNK_READS/fork0/chnk0-u5f60ff69b4/files/', 'fastq_chunk', '--martian-args', 'chunk_args.json', '--compress', 'lz4']' returned non-zero exit status 1

This is an error we've encountered before. It is due to FASTQ files not being available or being too small. Please check the files in the input4cellranger- directory. Ensure R1 and R2 have the same number of lines and the sequence and quality scores are the same length (it could be a bug with the patch I tested last week).

Note that the whitelist for splitseq needs to be 24 bp in length as there are 3 barcodes joined together. With all permutations of 8 bp BCs, you can then generate all combinations of [BC1]-[BC2]-[BC3]. Here is an example of the code doing that. https://github.com/minoda-lab/universc/blob/7dc550ea9ffd4e5ddb54a5cf8d10bc679b1d68cb/launch_universc.sh#L1916 https://github.com/minoda-lab/universc/blob/7dc550ea9ffd4e5ddb54a5cf8d10bc679b1d68cb/launch_universc.sh#L2017-L2018

This will allow all possible barcodes to run but may not guarantee they are correct. If the number of cells is very low, it is possible the adapter sequences were removed incorrectly. I've used similar specifications to other tools supporting this technology (zUMIs, salmon/alevin, dropEst) but the reverse complement is needed for NextSeq and NovaSeq and as others have noted, it is computationally challenging with variability in the adapters.

I hope this helps to narrow-down the problem. Please check the results carefully as this feature one of the more difficult technologies for us to handle.

ayyildizd commented 1 year ago

Hi Tom, Thanks for suggestions, I will try --setup in that way. But what you suggest for whitelist is not feasible right? So I have 8bp barcode list which is all permutations of A,T,G and C and this gives me 4^8=65,536 possible barcodes. If I use this 8bp barcode list to create 24bp all possible barcode combinations using 'join' function it would be creating a file with 4^24 lines! To test, I tried your code with 'join' using only 2 files and output file contains 16bp barcodes and the file had 4^16 lines (it was >60 GB). And besides, as you said this will allow all barcodes to pass and I am not sure how many false positives this would bring. Do you think I cannot proceed with split-seq without providing white-list in this way? Ideally it would be nice if the program gets a certain list of barcodes1-3 and pass only the ones (maybe allowing 1-2 mismatches) that are in these lists in correct order right?

TomKellyGenetics commented 1 year ago

Hi Dilara,

Sorry to hear you still have trouble. Let's see if we can help.

We've done as many as 16 bp barcodes with all combinations (4^8) but it is slow for downstream analyses if invalid barcodes is filtered out (even on a server with plenty of memory). All combinations of 3 x 8 bp segments would be (4^8)^3 so it would be large. It is better to use known barcodes for the technology to avoid this.

For example, the barcodes are given here kharchenkolab/dropEst/data/barcodes/split_seq where each row is BC1, BC2, or BC3. They appear to be identical with 96 barcodes each. I recall checking this was consistent with the supplementary data for Rosenberg et al., (2018).

Therefore every 3 permutations of this whitelist should give the correct 96^3 (884,736) barcodes. This is around the same number as 10x v2 so it should be supported by Cell Ranger.

ACAGTGGT
ACTTGATG
ATCACGTT
CAGATCTG
CGATGTTT
CTTGTACT
GAATCTGT
GACCTTAG
GACGGATT
GAGCCAAT
GAGGATGG
GAGGTGCT
GATAGAGG
GATCAGCG
GATCTCTT
GATTCATC
GCAACATT
GCAATCCG
GCACTGTC
GCATGGCT
GCCAATGT
GCCTGTTC
GCTAACTC
GCTCCTTG
GGAATGAT
GGATTAGG
GGCTACAG
GGTCGTGT
GGTGAGTT
GTAAGGTG
GTACATCT
GTCGCTAT
GTCTTGGC
GTGTCCTT
GTGTGTCG
GTTAGCCT
GTTGTCGG
TAACGCTG
TAAGCGTT
TAAGTTCG
TACAGGAT
TACCACCA
TACCGAGC
TACTAGTC
TACTTCGG
TAGAACAC
TAGACGGA
TAGCTTGT
TAGTCTTG
TAGTGACT
TATGCCAG
TATGTGGC
TCAGATTC
TCAGGAGG
TCATCCTA
TCATTGAG
TCCAGTCG
TCCGTCTT
TCCTCAAT
TCGAAGTG
TCGAGCGT
TCGTTAGC
TCTACGAC
TCTCACGG
TCTCGGTT
TCTCTTCA
TCTGCTGT
TGAACTGG
TGAAGCCA
TGACAGAC
TGACCACT
TGATACGT
TGCATAGT
TGCGATCT
TGCGTGAA
TGCTGATA
TGGCTCAG
TGGTTGTT
TGTACCTT
TGTATGCG
TGTCTATC
TGTGAAGA
TGTGGTTG
TGTTCTCC
TTACTCGC
TTAGGCAT
TTCAGCTC
TTCCATTG
TTCCTGCT
TTCGCACC
TTCTGTGT
TTGACTCT
TTGCGTAC
TTGGAGGT
TTGGTATG
TTGTTCCA

The potential issues remaining to match this with your sequences are whether a reverse complement is needed (I can check this on your example data above) and if the adapters are aligned correctly. Notably 94 cycles matches the specifications in zUMIs and dropEst example configurations but there are leading NN basecalls in the R2 FASTQ file with poor quality scores. I've adjusted the adapter trimming to remove these but that may cause mismatches to the whitelist.

It is possible to account for mismatches but it is not currently supported by our scripts. Generally, I am not sure it is beneficial as accurate UMI sequences are necessary to get accurate count data and with sufficiently deep sequencing, errors in barcodes will be filtered out as low coverage cells while the same molecules will be re-sequenced with the same UMI. There are diminishing margins improvements from implementing this. This is also a difficult problem for us as unlike technology-specific pipelines, we cannot assume that the technology users will run has been designed with barcodes with sufficient differences (by Hamming or Levenshtein distance) to avoid mismatches swapping barcodes with another cell. However, our pipeline is compatible with pre-processed reads, provided they are the same length and not truncated with full barcodes.

It is possible do generate consensus reads using UMI parameters in "fastp" for example. I'd also recommend trimming poor quality R2 reads and filtering trimmed reads that are shorter than 94 bp. Make sure to use a tool that supports paired-end reads or match the R1 reads after processing (https://github.com/linsalrob/fastq-pair). In my experience, mapping trimmed reads also gives a higher count per cell.

I'll compare the expected barcodes to the public data you've shared and update the script if necessary to match them correctly.

TomKellyGenetics commented 1 year ago

I confirmed the reverse-complement of the above whitelist matches the barcodes in R here: Note that this indicates that the 10 bp UMI sequences (bases 10-1) includes the leading NN from the beginning of R2. I'll need to adjust the expected barcode locations for B3 (to bases 18-11 in R2). If the "NN" UMI sequence is incompatible with Cell Ranger, we may need to remove it by hard trimming and set UMI length to 8 bp for SPLiT-Seq. I can automate this as it requires changes to the code to support it.

I also checked by diff and md5sum these match the barcodes already bundled in the UniverSC whitelists.

5924cc3e7693e782ea2efb4543431822 whitelist.revcomp.txt 5924cc3e7693e782ea2efb4543431822 whitelists/split-seq_round3_barcodes.txt

However, whitelists/splitseq_barcode.txt generated automatically from these is empty so I will check this script for bugs. https://github.com/minoda-lab/universc/blob/7dc550ea9ffd4e5ddb54a5cf8d10bc679b1d68cb/launch_universc.sh#L2075-L2080

Kai: I'll handle this issue.

Diliara: please wait until the development version is updated as the current version filters adapters incorrectly.

TomKellyGenetics commented 1 year ago

It should also be possible to support both the original barcodes and reverse complement. All but 1 (96*2 -1 = 191) are not palindromic so there are 191^3 (6,967,871) permutations. 10x v3 uses over 3 million barcodes so this would require around twice as much memory as a 10x run.

TomKellyGenetics commented 1 year ago

The development branch is now updated: https://github.com/minoda-lab/universc/compare/4edff5bf6093a429779179666081f40536529e55...9ba7bbabeaee8d598cb9b1c730e430adaf5724c0

Please try pulling it (from branch dev) and running Split-Seq v1 parameters. I've set up the default whitelist to be compatible with reverse-complement sequences and confirmed the adapters are removed correctly. Barcodes match the input4cellranger files as follows:

Therefore a correct [24 bp barcode][8 bp UMI] has been generated. Note UMI is automatically filled with trailing AAs to fit longer UMI expected by Cell Ranger.

Published SPLiT-Seq data is now fully-supported based on my tests. The version is also pre-released as a docker image v.1.2.5.2-dev.

ayyildizd commented 1 year ago

Many thanks Tom really!

I updated the tool and run it but I had the same error. I confirmed that R1 and R2 are same length but I found out that R2 had unequal reads (some reads still same) see below:

head -n 100 input4cellranger_test_SRR16248559/SRR16248559_S1_L001_R2_001.fastq
@SRR16248559.1 1 length=94
NNCGTCAGGACAAGGAGCTTGCACGATGAGTAAAGCATCACCCTACGACTATGCCTAAATCCACGTGCTTGAGAGGCCAGAGCATTCGAGTACAAG
+SRR16248559.1 1 length=94
##AAAAEAEAAEE6EEEEA/////E//EE/</II//6E6//A/6<<AEA<A/<EAEEEA<<EAEE6EE6<EEE/EE6AA<<A6/6<6<<EEEEEE/
@SRR16248559.2 2 length=94
CCTCTATCCGGATTGCGTGTTCTAATTATGGTAA
+SRR16248559.2 2 length=94
6AEEE<EA6EEEEE6EEEEEEEEEA/A/AEEEII
@SRR16248559.3 3 length=94
TGAAGAGAACAGATTCGTCTGTCAAGGGATATAA
+SRR16248559.3 3 length=94
E<E<AEA/6AEEEEEEEEEEEEEEAAAEEEEEII
@SRR16248559.4 4 length=94
CCGTGAGAACATTGGCACCACTGTCCTCCGAGAA
+SRR16248559.4 4 length=94
AEEEEAE/AEEEEEAE/EAEEEEE6AAEE/E/II
@SRR16248559.5 5 length=94
AGTACAAGTTCACGCACGCATACAGCATAATGAA
+SRR16248559.5 5 length=94
EEAEEEEE/EEAEEEEEEEEEEEAAAAEEEEEII
@SRR16248559.6 6 length=94
GAGTTAGCCGACTGGAACGTATCAAGTGAAGTAA
+SRR16248559.6 6 length=94
EEEEEEEE<EEEEEEEEEEEEEEEAAAEEEEEII
@SRR16248559.7 7 length=94
ACGCTCGACTGAGCCACCGAAGTACCACGAGTAA
+SRR16248559.7 7 length=94
AEEEEEEEAEEAEEEEEEEEEEEEAAAEEEEEII
@SRR16248559.8 8 length=94
CCATCCTCATCCTGTAAGGCTAACTTCTTTATAA
+SRR16248559.8 8 length=94
EAEEEEEE<EEEEEE<EEEEEEAEAAAEEEEEII
@SRR16248559.9 9 length=94
NNTCGTGTAAAGCAGGAAGCGTTAGATCGGTCAAGCATCTTTGTACGACTCCTAATCCATCCACGTGCTTGAGAGGCCAGAGCATTCGACGTATCA
+SRR16248559.9 9 length=94
##AAAEEEEEEEEEEEEE////////A/////II//E/E/////</A/<<6EEEAEEEAE</AAEAE<6<EEEEEEAA<6AA666A6<AEEEEEEE
@SRR16248559.10 10 length=94
AAGGTACACCTCCTGAAGATGTACTTTAGTAAAA
+SRR16248559.10 10 length=94
6AEEAEEA6<EAA/E/EEEAE/EEAAAAEEEEII
@SRR16248559.11 11 length=94
NNAGTAGTCCAAACATCGACGGACGCTGCGGCAAGCAGCCGCCGTCGACTGTGTTCTAATCCACGTGCTTGAGAGGCCAGAGCATTCGCCTCCTGA
+SRR16248559.11 11 length=94
##AAAEEEEEEEAEA6EE////////6/////II//A//E//E///6//6/EAEEEEE/A///6E/EA/</////E<////666/<6/EEEE/AEE
@SRR16248559.12 12 length=94
NNAATATCACCACCTTACTAGGGCGCTCGGGCAAGGATCGGCGGACGACTATCATTCCATCCACGTGCTTGAGAGGCCAGAGCATTCGAAGACGGA
+SRR16248559.12 12 length=94
##AAAEEEAEEA/EE6EE////////<////<II//E/6/EEE/6/<E<<A/EAEEEEEAAEEAE/E66/<EEAEE<<<6AA666A6<//EEE/<A
@SRR16248559.13 13 length=94
NNTGTGAGGGCCAGTTCAGAGGCCAATGTGGCAAGCATCGGCGTACGACTCTGTAGCCATCCACGTGCTTGAGAGGCCAGAGCATTCGGACTAGTA
+SRR16248559.13 13 length=94
##AAAEEAEEE/EEEE/E///<<///6EE//EII/E<E//E/E/66<EA<AAAEAEEE/EA//AEAEA6AEAEEEE<AA<AA6/6A6<AAEEEAA/
@SRR16248559.14 14 length=94
TCTTCACAGACTAGTATCCGTCTATGTTTTCAAA
+SRR16248559.14 14 length=94
AEEEE/6/6EEEEEEEEAEEEAA//AAEEEEAII
@SRR16248559.15 15 length=94
CAAGACTAGAGCTGAAAACAACCATGGTACGCAA
+SRR16248559.15 15 length=94
EEEEEEEA6EEEEEEEEEEEEEEEAAAEEEEEII
@SRR16248559.16 16 length=94
NNTGGAGGGGCCGAAGTAGTGGAGGCTGTGGCAAGCATCGGCGTACGACTCCTCCTGAATCGACGTGCTTGAGAGGCCAGAGCATTCGGATGAATC
+SRR16248559.16 16 length=94
##AAAEEEEEEEEEEEEE6////////E/A</IIE/<AEEEEE/</AE6<<EEEEEEEEEE/<EEAEA6AEEEEEEAA<6AA666A6<EEEEEEEE
@SRR16248559.17 17 length=94
NNTACTGTCAAACTCACCTTGGGGGATGTGTCAAGGATCGTCGTACGACTCATACCAAATCGACGTGCTTGAGAGGCCAGAGCATTCGTGGAACAA
+SRR16248559.17 17 length=94
##66AE6E//EEEEE/E/A//<////6<//</IIE///E//EE/6/AE/66AEEE/EE/AA/6EEAE66AE/EAEE<A</<A666<6<EEEEEEEE
@SRR16248559.18 18 length=94
NNACATTTCAAGTGGTCAGAGTTGGGTGGGTCAAGGAGCGGCGTACGACTAACGTGATATCCACGTGCTTGAGAGGCCAGAGCATTCGAACAACCA
+SRR16248559.18 18 length=94
##AAAEEEEEEEEEEEEE/////////E//6/II6/E/EAAEE/</AEAAAEEEEEEE6EA//EEAEE/AEEEAEEA6<6AA666A6<AAEEAEEE
@SRR16248559.19 19 length=94
NNGGGGGTTGAACTCACCGTGGCGGCTCCGTCAAGGATCTGCGGACGACTCAAGGAGCATCGACGTGCTTGAGAGGCCAGAGCATTCGAACTCACC
+SRR16248559.19 19 length=94
##AAAEEEEEEEEEEAEE///A////E/E/</II/AA/E//EE/</AE/<<AEEEAEEAEEAAEE/EA/AEEEEEE<A<6AA666<6<AEEEEEEE
@SRR16248559.20 20 length=94
NNCTCTTTTTATGCCTAAGAGTAAGATCATGAAAGAATCATCCTACGACTACAGATTCATCCACGTGCTTGAGAGGCCAGAGCATTCGATCCTGTA
+SRR16248559.20 20 length=94
##AAAEEEEE6EEEEEAEA/A/EE/EAEE/AAIIE/6AE//E/A66A/666E<EAEAA/</E/6/EEE6/E</AE/<6A<<A66666<6EEEEEE/
@SRR16248559.21 21 length=94
NNTAGCCCTCCCTCTATCTAGTTTGCTGGGTAAAGGATCGTCGTACGACTACAGATTCATCGACGTGCTTGAGAGGCCAGAGCATTCGAACAACCA
+SRR16248559.21 21 length=94
##AAAEE/EEEEEEEAEE6//A////E///A/II//EAE/AAA/</AE/6A/EAEAEEEAA/AEEEE66AEEAEA6AA<6AA666A6<A<E/EEEE
@SRR16248559.22 22 length=94
AGTGGTCAAAGGTACAATTGAGGAGTAAACGGAA
+SRR16248559.22 22 length=94
AEEEEE<E<EE6EEEEEEEEAEEEAAAEEEEEII
@SRR16248559.23 23 length=94
NNGGTGGGTTCCGTGAGAGTGCCTGCTGGGGCAAGCATCGCCGTACGACTAATCCGTCATCCACGTGCTTGAGAGGCCCTGTCTCTTATACACATC
+SRR16248559.23 23 length=94
##AAAEEAEAEEEEEA6E//<A////AE/A//II//E/E//EE//6<E6/</EEEEEE<666AAE6E<//E<EEEA<6AE/E<EAE<AEEEEEAAA
@SRR16248559.24 24 length=94
NNGCGGCCCCCTCAATGAGGGGTGGGTGGGGGAAGGATCGGCGGACGACTCAACCACAATCGACGTGCTTGAGAGGCCAGAGCATTCGCAAGACTA
+SRR16248559.24 24 length=94
##AAAEEEEEAEEEEEEE//////A/A/////II/<E/E///E/6/AA6<<EEEEEEEE<AE/EE<E//<AEEEEEAA<6A<<<6A6<EEAEEAA/
@SRR16248559.25 25 length=94
NNATAGGCCCCTGTAGCCTGGCTGATTTAGGGAACTACCGCCTGAGGAGTGATGCTGAATCCACGTGGTTGAGAGGCCAGAGCATTCGTGGTGGTA
+SRR16248559.25 25 length=94
##AAAEEEEE/AEE/6E//////E/</6////II////////////6<///6/////A//6//6////////AEE/A/////6666//A</EE/EE

TomKellyGenetics commented 1 year ago

Hi Dilara,

Sorry to hear you still have trouble. Which error are you referring to? The memory issues discussed above should be resolved with the new barcode whitelist (I change the default name to force it to update existing installs). There is no need for you to set --barcodefile anymore to clarify, the updated source code should handle it.

Is it the issue with Cell Ranger calling 'chunk_reads', '--reads-per-fastq' ...? You may need to delete the input4cellranger directory and run it again with fresh settings (or change the run ID/output directory name).

Some untruncated lines are expected. Those beginning with 'NN' were not converted as there were mismatches to the adapter sequence (due to sequencing errors and variable adapters). These should be filtered out as invalid barcodes as they will not match the barcode whitelist. It's not ideal as some data will not be used but I think these reads have poorer sequence quality. As the sequence and quality scores are the same length for each read, I would expect Cell Ranger can parse them in the 'chunk reads' step. Is the other FASTQ file valid and does it have the same number of reads (same number of lines by wc -l is a quick sanity check).

On the other hand, the adapters seem to be a fixed length in a fixed position (as described for zUMIs, dropEst, and Salmon). The known barcodes match these parameters. It turns out the original problem was cause by syntax errors in SED and the reverse complement sequence in R2 by NextSeq on v1.5 chemistry. I think there is no longer a need to match exact adapter sequences and we can safely assume barcodes are in a fixed position (i.e., separated by 30 bp adapters): NN[ 8 bp UMI][ 8 bp BC3]---30 bp----[8 bp BC2]---30 bp---[8 bp BC1].

It is a very minor change in the source to support this but it would break automated detection of the reverse complement. Ideally, I'd prefer it wasn't necessary to configure the run differently for HiSeq or NovaSeq data.

To force it to run for your data I'll provide a patch to update the code.

diff --git a/launch_universc.sh b/launch_universc.sh
index f624f7d..5c14022 100755
--- a/launch_universc.sh
+++ b/launch_universc.sh
@@ -3518,8 +3518,8 @@ else
             mv ${crIN}/.temp $convFile
             #remove phase blocks and linkers (reverse complement if R2 matched)
             sed -E '
-                /.*?(.{8})(.{8})..G[GT]..G[AC]TG.G[GT]..........[GT]A[AC]GACT(.{8})AT[CT]CACGTGCTTGAG........GCATTCG(.{8}).*/ {
-                s/.*?(.{8})(.{8})..G[GT]..G[AC]TG.G[GT]..........[GT]A[AC]GACT(.{8})AT[CT]CACGTGCTTGAG........GCATTCG(.{8}).*/\4\3\2\1/g
+                /[ATCGN][ATGCGN](.{8})(.{8}).{30}(.{8}).{30}(.{8}).*/ {
+                s/..(.{8})(.{8}).{30}(.{8}).{30.(.{8}).*/\4\3\2\1/g
                 n
                 n
                 s/..(.{8})(.{8}).{30}(.{8}).{30}(.{8}).*/\4\3\2\1/g

Save the above in a file called 'patch.txt' and run

git apply patch.txt
git add launch_universc.sh
git commit -m "fixed length split-seq adapters for reverse complement"

ayyildizd commented 1 year ago

Hi Tom,

Unfortunately the pipeline still fails at this step after all these editions:

[runtime] (failed)          ID.test_SRR16248559.SC_RNA_COUNTER_CS.SC_RNA_COUNTER.CHEMISTRY_DETECTOR.DETECT_CHEMISTRY

[error] You selected chemistry 'SC3Pv2', which expects the cell barcode sequence in read R1.
In input data, an extremely low rate of correct barcodes was observed for this chemistry (0.00 %).
Please check your input data and chemistry selection. Note: manual chemistry detection is not required in most cases.
Input: {'lanes': [u'1'], 'sample_names': [u'SRR16248559'], 'sample_indices': None, 'fastq_mode': u'ILMN_BCL2FASTQ', 'read_path': u'{basedir}/input4cellranger_test_SRR16248559', 'interleaved': False}

Shutting down.

In the end I tried this samples with STAR solo and it worked and since I have time limitation I decided to drop out this pipeline and move on with STAR solo. I really thank you for all helps so far! I am closing the issue.