Issue with droptag/dropest on SPLiT-seq data, no CBs recognized

I'm trying to use dropEst to process SPLiT-seq data, specifically using SRR6750042 from GSE110823 as a test run. Note that I am able to successfully run dropEst on the example data (inDrop) to completion.

If I run dropTag specifying the FASTQ files in the order stated in the documentation (gene reads, barcode reads), it recognizes 0 reads and prints an empty output file:

Run: 09:46:18.
 (0 reads)
Total 1 short reads (inf%)
    Trim stat:
[ (no trim)  (RC)  (PolyA)  (-A) ]
[0 0 0 0 ]
[-nan -nan -nan -nan ] %

Reading completed: 09:46:18.
Writing the rest lines
All done: 09:46:18.

If I reverse the order (barcode reads, gene reads), it does produce an output:

Run: 09:45:53.
 (250000 reads)
Total 0 short reads (0.000000%)
    Trim stat:
[ (no trim)  (RC)  (PolyA)  (-A) ]
[0 0 0 0 ]
[-nan -nan -nan -nan ] %

Reading completed: 09:45:54.
Writing the rest lines
All done: 09:45:54.

however the output FASTQ doesn't appear to be properly tagged, as the headers are in the format @EQNT1, compared to what I saw for the inDrop example run in the demo, which is in the format @WETA1!GGGGGGGGGGGGCGGA#CGGGGG.

Downstream of this, dropEst runs with errors on every read, saying "ERROR: unable to parse out UMI in..." along with this at the end:

no valid CBs found

Start merge: 16:35:57.
Merge initialized: 16:35:57.
Total 0 cells merged
Total 0 cells excluded
Merge finished: 16:35:57.
Merge UMIs with N's: 16:35:57.
0 cells processed. Merged 0 UMIs from 0 cells.
UMI merge finished: 16:35:57.
0 cells are considered as real.

0 CBs with more than 20 genes, which have UMIs of the requested type.
no valid CBs found

Done: 16:35:57.
WARNING: filtered cells are empty. Probably, filtration threshold is too strict or you forgot to run 'merge_and_filter'

0 genes
top genes:

So it seems like it's not finding the cell barcode, likely because they're not getting tagged in the FASTQ properly to begin with.

Here's how I ran dropTag: droptag -c /path/to/dropEst/configs/split_seq.xml -S -s -l /path/test.tagged -n /path/test.tagged -p 8 /path/file_2.fq /path/file_1.fq

Here's how I ran dropEst, after aligning with STAR: dropest -w -M -G 20 -g /path/RefSeq_mm10_refFlat_021419.gtf -c /path/to/dropEst/configs/split_seq.xml -l /path/test -o /path/test_dropest /path/test_STARAligned.out.bam

Looking at the configuration file, the designated bases for locating the barcodes seem to be correct. I've attached here a FASTQC sequence content plot of my barcode reads, which seem to line up with (0-based) barcode starts of 10, 48, and 86 like in the configuration file.

download

This also matches what I can count from this page: https://teichlab.github.io/scg_lib_structs/methods_html/SPLiT-seq.html

On a related note, the split_seq barcodes file might be inappropriately looking for the reverse complement of the actual barcodes. If I grep for the barcodes listed on that teichlab link above, I find way more matches in my FASTQs at the expected positions than I do the sequences listed in the split_seq barcodes file. However, changing these sequences and re-running dropTag does not seem to fix my issue.

Apologies for the long post, but there seem to be 3 inter-related issues at hand: 1) The documentation needs to be corrected to list "barcode reads, gene reads" 2) The split_seq barcodes file might need to be switched to search for the reverse complement of those supplied 3) Some unknown issue is preventing proper tagging of SPLiT-seq FASTQ files

Please let me know if there's anything else I can give you to help diagnose the issue.

kharchenkolab / dropEst

Issue with droptag/dropest on SPLiT-seq data, no CBs recognized #74