Open jeremymsimon opened 5 years ago
Here's how I ran dropTag: droptag -c /path/to/dropEst/configs/split_seq.xml -S -s -l /path/test.tagged -n /path/test.tagged -p 8 /path/file_2.fq /path/file_1.fq
Here's how I ran dropEst, after aligning with STAR: dropest -w -M -G 20 -g /path/RefSeq_mm10_refFlat_021419.gtf -c /path/to/dropEst/configs/split_seq.xml -l /path/test -o /path/test_dropest /path/test_STARAligned.out.bam
I had similar error.
accoiding to issue #59 , if we run droptag using -s
option, information about CB is written in ***.fastq.gz.tagged.params.gz
.
So we need to add an option --read-params ***.fastq.gz.tagged.params.gz
in dropest command.
In my case, then it completed analyzing splitseq data.
I'm trying to use dropEst to process SPLiT-seq data, specifically using SRR6750042 from GSE110823 as a test run. Note that I am able to successfully run dropEst on the example data (inDrop) to completion.
If I run dropTag specifying the FASTQ files in the order stated in the documentation (gene reads, barcode reads), it recognizes 0 reads and prints an empty output file:
If I reverse the order (barcode reads, gene reads), it does produce an output:
however the output FASTQ doesn't appear to be properly tagged, as the headers are in the format
@EQNT1
, compared to what I saw for the inDrop example run in the demo, which is in the format@WETA1!GGGGGGGGGGGGCGGA#CGGGGG
.Downstream of this, dropEst runs with errors on every read, saying "ERROR: unable to parse out UMI in..." along with this at the end:
So it seems like it's not finding the cell barcode, likely because they're not getting tagged in the FASTQ properly to begin with.
Here's how I ran dropTag:
droptag -c /path/to/dropEst/configs/split_seq.xml -S -s -l /path/test.tagged -n /path/test.tagged -p 8 /path/file_2.fq /path/file_1.fq
Here's how I ran dropEst, after aligning with STAR:
dropest -w -M -G 20 -g /path/RefSeq_mm10_refFlat_021419.gtf -c /path/to/dropEst/configs/split_seq.xml -l /path/test -o /path/test_dropest /path/test_STARAligned.out.bam
Looking at the configuration file, the designated bases for locating the barcodes seem to be correct. I've attached here a FASTQC sequence content plot of my barcode reads, which seem to line up with (0-based) barcode starts of 10, 48, and 86 like in the configuration file.
This also matches what I can count from this page: https://teichlab.github.io/scg_lib_structs/methods_html/SPLiT-seq.html
On a related note, the split_seq barcodes file might be inappropriately looking for the reverse complement of the actual barcodes. If I grep for the barcodes listed on that teichlab link above, I find way more matches in my FASTQs at the expected positions than I do the sequences listed in the split_seq barcodes file. However, changing these sequences and re-running dropTag does not seem to fix my issue.
Apologies for the long post, but there seem to be 3 inter-related issues at hand: 1) The documentation needs to be corrected to list "barcode reads, gene reads" 2) The split_seq barcodes file might need to be switched to search for the reverse complement of those supplied 3) Some unknown issue is preventing proper tagging of SPLiT-seq FASTQ files
Please let me know if there's anything else I can give you to help diagnose the issue.