pachterlab / kb_python

A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing
https://www.kallistobus.tools/
BSD 2-Clause "Simplified" License
141 stars 23 forks source link

Support for SMARTSEQ3 single end data. #216

Closed hakobyansiras closed 8 months ago

hakobyansiras commented 10 months ago

Describe the issue I have SMARTSEQ3 single-end data and want to use kb count. However, the single-end mode is not supported for SMARTSEQ3 technology. Is there any solution or workaround for this?

What is the exact command that was run?

kb count -i kallisto_files/cDNA_introns.idx -g kallisto_files/tr2g.tsv -x SMARTSEQ3 -o kallisto_counts -t 24 -m 64G -w /data/bam_to_loom/BCwhitelist.txt --parity single --fragment-l 53 --verbose \
-c1 kallisto_files/cDNA_tx_to_capture.txt -c2 kallisto_files/introns_tx_to_capture.txt --workflow lamanno \
fastq_files/index1_5pbody_filtered.fastq.gz \
fastq_files/index2_5pbody_filtered.fastq.gz \
fastq_files/read1_5pbody_filtered.fastq.gz

Command output (with --verbose flag)

[2023-08-22 18:56:35,749]   DEBUG [main] Printing verbose output
[2023-08-22 18:56:38,014]   DEBUG [main] kallisto binary located at /home/ubuntu/anaconda3/lib/python3.9/site-packages/kb_python/bins/linux/kallisto/kallisto
[2023-08-22 18:56:38,014]   DEBUG [main] bustools binary located at /home/ubuntu/anaconda3/lib/python3.9/site-packages/kb_python/bins/linux/bustools/bustools
[2023-08-22 18:56:38,016]   DEBUG [main] Creating `kallisto_counts/tmp` directory
[2023-08-22 18:56:38,018]   DEBUG [main] Namespace(list=False, command='count', tmp=None, keep_tmp=False, verbose=True, i='kallisto_files/cDNA_introns.idx', g='kallisto_files/tr2g.tsv', x='SMARTSEQ3', o='kallisto_counts', w='/data/bam_to_loom/BCwhitelist.txt', t=24, m='64G', strand=None, workflow='lamanno', em=False, umi_gene=False, mm=False, tcc=False, filter=None, filter_threshold=None, c1='kallisto_files/cDNA_tx_to_capture.txt', c2='kallisto_files/introns_tx_to_capture.txt', overwrite=False, dry_run=False, loom=False, h5ad=False, cellranger=False, gene_names=False, report=False, no_inspect=False, kallisto='/home/ubuntu/anaconda3/lib/python3.9/site-packages/kb_python/bins/linux/kallisto/kallisto', bustools='/home/ubuntu/anaconda3/lib/python3.9/site-packages/kb_python/bins/linux/bustools/bustools', no_validate=False, parity='single', fragment_l=53, fragment_s=None, fastqs=['fastq_files/index1_5pbody_filtered.fastq.gz', 'fastq_files/index2_5pbody_filtered.fastq.gz', 'fastq_files/read1_5pbody_filtered.fastq.gz'])
usage: kb [-h] [--list] <CMD> ...
kb: error: Argument `parity` is not supported for technology `SMARTSEQ3`.
[2023-08-22 18:56:38,021]   DEBUG [main] Removing `kallisto_counts/tmp` directory
Yenaled commented 10 months ago

Hmm, all smartseq3 data that I've used has been paired-end.

For single-end, you might want to consider just supplying the read1 file twice (to "pretend" that it's paired-end) since kallisto mapping is based off k-mers so supplying a sequence multiple times across files won't affect mapping.

Alternately, you can extract all the UMI reads into one FASTQ file, process it like you would process 10X-type, and then separately process the internal reads.

hakobyansiras commented 10 months ago

Providing a read1 file twice worked. The job finished successfully. Will check the output in the next few days. Thanks!

hakobyansiras commented 9 months ago

I assume something is going wrong when I provide the same fastq file twice. My initial fastq file contains almost 250M reads. After running kb count I am getting two types of unfiltered counts: umi and internal

spliced_umi: 32722 unspliced_umi: 244190 total: 276912

spliced_internal: 59795952 unspliced_internal: 20209736 total: 80005688

Then I tried to only keep UMI reads and see how it works. After filtering the fastq files I got 23M reads. Strangely enough, I got more internal counts than umi counts.

spliced_umi: 6760 unspliced_umi: 41831 total: 48591

spliced_internal: 115228 unspliced_internal: 36516 total: 151744

Here is the code that I used. The same code was used for filtered fastq files. kb count -i kallisto_files/cDNA_introns.idx -g kallisto_files/tr2g.tsv -x SMARTSEQ3 -o kallisto_all_counts -t 24 -m 64G -w /data/bam_to_loom/BCwhitelist.txt --verbose \ -c1 kallisto_files/cDNA_tx_to_capture.txt -c2 kallisto_files/introns_tx_to_capture.txt --workflow lamanno \ /data/bam_to_loom/fastq_files/Smartseq3.HEK.fwdprimer.index1.fastq.gz \ /data/bam_to_loom/fastq_files/Smartseq3.HEK.fwdprimer.index2.fastq.gz \ /data/bam_to_loom/fastq_files/Smartseq3.HEK.fwdprimer.read1.fastq.gz \ /data/bam_to_loom/fastq_files/Smartseq3.HEK.fwdprimer.read1.fastq.gz

To wrap up, I have two general questions:

  1. How are UMI and internal reads determined?
  2. Why are a lot of reads lost?
Yenaled commented 9 months ago

The UMI vs. internal reads are determined by a tag sequence present in the first 11-bp of the read (the tag is ATTGCGCAATG). If present, the read is determined to be "UMI". Otherwise, the read is determined to be "internal".

Maybe in your case, a different "tag" is present? You can probably run FASTQC to see what's overrepresented in the first 11 or so bp's of a read.

Also, can you copy+paste your run_info.json file here?

hakobyansiras commented 9 months ago

Here is the run_info. Sorry for the confusion initial fastq file was 294M reads.

{ "n_targets": 0, "n_bootstraps": 0, "n_processed": 294457441, "n_pseudoaligned": 118330099, "n_unique": 16529513, "p_pseudoaligned": 40.2, "p_unique": 5.6, "kallisto_version": "0.48.0", "index_version": 0, "start_time": "Fri Aug 25 14:46:46 2023", "call": "/home/ubuntu/anaconda3/lib/python3.9/site-packages/kb_python/bins/linux/kallisto/kallisto bus -i kallisto_files/cDNA_introns.idx -o kallisto_all_counts -x SMARTSEQ3 -t 24 --paired /data/bam_to_loom/fastq_files/Smartseq3.HEK.fwdprimer.index1.fastq.gz /data/bam_to_loom/fastq_files/Smartseq3.HEK.fwdprimer.index2.fastq.gz /data/bam_to_loom/fastq_files/Smartseq3.HEK.fwdprimer.read1.fastq.gz /data/bam_to_loom/fastq_files/Smartseq3.HEK.fwdprimer.read1.fastq.gz" }

Yenaled commented 9 months ago

Hmm, ok, I have two more suggestions that should resolve the problem:

  1. Check whether your tag is actually ATTGCGCAATG. If you can run FASTQC and show the "overrepresented sequences" result, that would be helpful.

  2. Finally, kb-python is just a wrapper that calls kallisto; you can also just try running kallisto itself out of the box:

/home/ubuntu/anaconda3/lib/python3.9/site-packages/kb_python/bins/linux/kallisto/kallisto bus -i kallisto_files/cDNA_introns.idx -o kallisto_all_counts -t 24 -x 0,0,0,1,0,0:2,0,19:2,22,0 --tag=ATTGCGCAATG /data/bam_to_loom/fastq_files/Smartseq3.HEK.fwdprimer.index1.fastq.gz /data/bam_to_loom/fastq_files/Smartseq3.HEK.fwdprimer.index2.fastq.gz /data/bam_to_loom/fastq_files/Smartseq3.HEK.fwdprimer.read1.fastq.gz

And see if that makes a difference in your output (check the run_info.json in your kallisto_all_counts directory for that).

hakobyansiras commented 9 months ago

Here is the run info from the command you provided.

{ "n_targets": 0, "n_bootstraps": 0, "n_processed": 294457441, "n_pseudoaligned": 229000103, "n_unique": 33460781, "p_pseudoaligned": 77.8, "p_unique": 11.4, "kallisto_version": "0.48.0", "index_version": 0, "start_time": "Mon Sep 18 19:32:06 2023", "call": "/home/ubuntu/anaconda3/lib/python3.9/site-packages/kb_python/bins/linux/kallisto/kallisto bus -i kallisto_files/cDNA_introns.idx -o kallisto_all_counts_bustools -t 24 -x 0,0,0,1,0,0:2,0,19:2,22,0 --tag=ATTGCGCAATG /data/bam_to_loom/fastq_files/Smartseq3.HEK.fwdprimer.index1.fastq.gz /data/bam_to_loom/fastq_files/Smartseq3.HEK.fwdprimer.index2.fastq.gz /data/bam_to_loom/fastq_files/Smartseq3.HEK.fwdprimer.read1.fastq.gz" }

Here are the numbers of matching reads to ATTGCGCAATG tag in the first 11-bp. 18634289 out of 22994185 for UMI filtered reads 119078244 out of 294457441 for non-filtered reads

Numbers were counted with this script. zcat Smartseq3.HEK.fwdprimer.read1.fastq.gz | awk '{if(substr($0, 1, 11) == "ATTGCGCAATG") print $0}' | wc -l

Yenaled commented 9 months ago

Ah hah, so now we get a much higher mapping rate (77.8%) which is what we'd expect! And it seems that the tag is correct since over 40% of your total reads match that ATTGCGCAATG tag!

Cool, so now that you've generated the new output into kallisto_all_counts_bustools -- can you try running the kb count command specifying that same output file (i.e. the "exact" same kb count command you ran in your yesterday, supplying the fastq file twice)? I believe that kb count will detect that the output file containing all the mappings (i.e. kallisto_all_counts_bustools/output.bus) will already exist and proceed as normal through the rest of the smart-seq3 protocol. And then you should be able to get the results that you need.

Basically, I realized that the the "paired-end" hack doesn't work (since the tag+UMIs will still exist in R2), so we need to run kallisto manually like in the previous post. And then, now, we can run our manual kallisto run through kb count.

hakobyansiras commented 9 months ago

Running kb count with the kallisto_all_counts_bustools output folder didn't work. All the alignment files are updated and run_info.json is the same as I reported previously. Probably will try to go step by step with Bustools and see if it will help.

Yenaled commented 9 months ago

Hmm ok, interesting. Usually kb count doesn't re-run the alignment if output.bus exists in the output folder.

And yeah, you can check the bustools commands that are run by using kb count --dry-run.

I'm going to try a few things on my end and see if there's an easier way to do things.

Yenaled commented 9 months ago

OK, I think the four steps should work:

  1. Run kb count (supplying the R1 fastq file twice) to get the kallisto_all_counts_bustools output
  2. Run the "kallisto bus" command to a separate output folder (let's call it kallisto_all_counts_bustools2)
  3. Transfer only the output.bus file from kallisto_all_counts_bustools2 to kallisto_all_counts_bustools
  4. Run kb count again (exactly as done in step 1); make sure kb count is NOT supplied with the --overwrite option

I think that should work (it works on my end).

Sorry it's a bit hack-y -- when I developed this workflow, I didn't expect there to ever be single-end smart-seq3 since the original paper only described paired-end reads for that technology :/ Nonetheless, everything is in the kallisto/bustools/kb-python software to get it to work (it just requires some workarounds here and there).

hakobyansiras commented 9 months ago

I am using this tutorial for the analysis. https://bustools.github.io/BUS_notebooks_R/velocity.html

Instead of running kb count I have used kallisto bus command you provided and the next steps from the tutorial.

Here is the code:

/home/ubuntu/anaconda3/lib/python3.9/site-packages/kb_python/bins/linux/kallisto/kallisto bus -i kallisto_files/cDNA_introns.idx -o kallisto_all_counts_bustools \
-t 24 -x 0,0,0,1,0,0:2,0,19:2,22,0 --tag=ATTGCGCAATG \
/data/bam_to_loom/fastq_files/Smartseq3.HEK.fwdprimer.index1.fastq.gz \
/data/bam_to_loom/fastq_files/Smartseq3.HEK.fwdprimer.index2.fastq.gz \
/data/bam_to_loom/fastq_files/Smartseq3.HEK.fwdprimer.read1.fastq.gz 

/data/kallisto_spliced_unspliced/soft/bustools/build/src/bustools correct -w /data/bam_to_loom/BCwhitelist.txt -p output.bus | \
/data/kallisto_spliced_unspliced/soft/bustools/build/src/bustools sort -o output.correct.sort.bus -t4 -

/data/kallisto_spliced_unspliced/soft/bustools/build/src/bustools capture -s -x -o spliced.bus -c /data/kallisto_spliced_unspliced/kallisto_files/introns_tx_to_capture.txt -e matrix.ec -t transcripts.txt output.correct.sort.bus

/data/kallisto_spliced_unspliced/soft/bustools/build/src/bustools capture -s -x -o unspliced.bus -c /data/kallisto_spliced_unspliced/kallisto_files/cDNA_tx_to_capture.txt -e matrix.ec -t transcripts.txt output.correct.sort.bus

/data/kallisto_spliced_unspliced/soft/bustools/build/src/bustools count -o unspliced -g /data/kallisto_spliced_unspliced/kallisto_files/tr2g.tsv -e matrix.ec -t transcripts.txt --genecounts unspliced.bus
/data/kallisto_spliced_unspliced/soft/bustools/build/src/bustools count -o spliced -g /data/kallisto_spliced_unspliced/kallisto_files/tr2g.tsv -e matrix.ec -t transcripts.txt --genecounts spliced.bus

Getting these counts after analysis. While with the same input data velocyto pipeline generates a loom file with 14039100 counts.

Spliced: 2028405 Unspliced: 1387404 Total: 3415809

Yenaled commented 9 months ago

That exact workflow won't work because it's smart-seq3 data which has both UMIs and internal reads in the BUS file that need to be separated. See the kb count commands (call kb count --dry-run if you want to see the commands that kb count uses for smart-seq3 data) like above and use those.

github-actions[bot] commented 8 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days