pachterlab / kb_python

A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing
https://www.kallistobus.tools/
BSD 2-Clause "Simplified" License
154 stars 23 forks source link

Issue with specifiying a new technology: no reads pseudoaligned and empty bus file #59

Closed mariusmessemaker closed 4 years ago

mariusmessemaker commented 4 years ago

Thank you so much for your great tool, I really like the possibility of specifying new technologies, RNA-velocity, and the upcoming feature barcoding option!

Describe the issue I tried to specify a new technology with a library index BC in R3 from 1-8 bp, a cell BC in R2 from 1-8 bp, a cell BC in R4 from 1-8 bp, a UMI in R3 from 9-14 bp, and the biological read in R1. However, I get 0 pseudo alignments (I can map the reads successfully with another mapper), and an empty bus file (see below for file sizes in kb count output directory).

What is the exact command that was run?

refdir=/n/data1/mgh/csb/pittet/references/mouse_mm10_99_kallisto_updateFeb2020

kb count --verbose --h5ad -i ${refdir}/index.idx -g ${refdir}/t2g.txt -x 2,0,8,1,0,8,3,0,8:3,8,14:0,0,0 -o ./pool1_7 \
-c1 ${refdir}/cdna_t2c.txt -c2 ${refdir}/intron_t2c.txt --lamanno --filter bustools -t 12 -m 100000 \
./SUB07818pool1_S1_R1_001.fastq.gz \
./SUB07818pool1_S1_R2_001.fastq.gz \
./SUB07818pool1_S1_R3_001.fastq.gz \
./SUB07818pool1_S1_R4_001.fastq.gz

Command output of kb count (with --verbose flag)

[2020-02-17 20:58:51,502]   DEBUG Printing verbose output
[2020-02-17 20:58:51,502]   DEBUG Creating tmp directory
[2020-02-17 20:58:51,505]   DEBUG Namespace(c1='/n/data1/mgh/csb/pittet/references/mouse_mm10_99_kallisto_updateFeb2020/cdna_t2c.txt', c2='/n/data1/mgh/csb/pittet/references/mouse_mm10_99_kallisto_updateFeb2020/intron_t2c.txt', command='count', fastqs=['./SUB07818pool1_S1_R1_001.fastq.gz', './SUB07818pool1_S1_R2_001.fastq.gz', './SUB07818pool1_S1_R3_001.fastq.gz', './SUB07818pool1_S1_R4_001.fastq.gz'], filter='bustools', g='/n/data1/mgh/csb/pittet/references/mouse_mm10_99_kallisto_updateFeb2020/t2g.txt', h5ad=True, i='/n/data1/mgh/csb/pittet/references/mouse_mm10_99_kallisto_updateFeb2020/index.idx', keep_tmp=False, lamanno=True, list=False, loom=False, m='100000', nucleus=False, o='./pool1_7', overwrite=False, t=12, verbose=True, w=None, x='2,0,8,1,0,8,3,0,8:3,8,14:0,0,0')
[2020-02-17 20:58:51,505]    INFO Generating BUS file from
[2020-02-17 20:58:51,505]    INFO         ./SUB07818pool1_S1_R1_001.fastq.gz
[2020-02-17 20:58:51,505]    INFO         ./SUB07818pool1_S1_R2_001.fastq.gz
[2020-02-17 20:58:51,505]    INFO         ./SUB07818pool1_S1_R3_001.fastq.gz
[2020-02-17 20:58:51,505]    INFO         ./SUB07818pool1_S1_R4_001.fastq.gz
[2020-02-17 20:58:51,509]   DEBUG /home/mm884/py37-kb/lib/python3.7/site-packages/kb_python/bins/linux/kallisto/kallisto bus -i /n/data1/mgh/csb/pittet/references/mouse_mm10_99_kallisto_updateFeb2020/index.idx -o ./pool1_7 -x 2,0,8,1,0,8,3,0,8:3,8,14:0,0,0 -t 12 ./SUB07818pool1_S1_R1_001.fastq.gz ./SUB07818pool1_S1_R2_001.fastq.gz ./SUB07818pool1_S1_R3_001.fastq.gz ./SUB07818pool1_S1_R4_001.fastq.gz
[2020-02-17 21:28:20,686]   DEBUG
[2020-02-17 21:28:20,686]   DEBUG [index] k-mer length: 31
[2020-02-17 21:28:20,686]   DEBUG [index] number of targets: 791,598
[2020-02-17 21:28:20,686]   DEBUG [index] number of k-mers: 1,112,499,218
[2020-02-17 21:28:20,686]   DEBUG [index] number of equivalence classes: 5,490,023
[2020-02-17 21:28:20,686]   DEBUG [quant] will process sample 1: ./SUB07818pool1_S1_R1_001.fastq.gz
[2020-02-17 21:28:20,687]   DEBUG ./SUB07818pool1_S1_R2_001.fastq.gz
[2020-02-17 21:28:20,687]   DEBUG ./SUB07818pool1_S1_R3_001.fastq.gz
[2020-02-17 21:28:20,687]   DEBUG ./SUB07818pool1_S1_R4_001.fastq.gz
[2020-02-17 21:28:20,687]   DEBUG [quant] finding pseudoalignments for the reads ... done
[2020-02-17 21:28:20,687]   DEBUG [quant] processed 867,926,292 reads, 0 reads pseudoaligned
[2020-02-17 21:28:20,687]   DEBUG [~warn] no reads pseudoaligned.
[2020-02-17 21:28:20,687]    INFO Sorting BUS file ./pool1_7/output.bus to tmp/output.s.bus
[2020-02-17 21:28:20,693]   DEBUG /home/mm884/py37-kb/lib/python3.7/site-packages/kb_python/bins/linux/bustools/bustools sort -o tmp/output.s.bus -T tmp -t 12 -m 100000 ./pool1_7/output.bus
[2020-02-17 21:28:20,753]   DEBUG Warning: low number supplied for maximum memory, defaulting to 64Mb
[2020-02-17 21:28:20,753]   DEBUG Read in 0 BUS records
[2020-02-17 21:28:20,753]   ERROR An exception occurred
Traceback (most recent call last):
  File "/home/mm884/py37-kb/lib/python3.7/site-packages/kb_python/main.py", line 483, in main
    COMMAND_TO_FUNCTION[args.command](args)
  File "/home/mm884/py37-kb/lib/python3.7/site-packages/kb_python/main.py", line 135, in parse_count
    nucleus=args.nucleus,
  File "/home/mm884/py37-kb/lib/python3.7/site-packages/kb_python/count.py", line 689, in count_velocity
    memory=memory
  File "/home/mm884/py37-kb/lib/python3.7/site-packages/kb_python/count.py", line 97, in bustools_sort
    run_executable(command)
  File "/home/mm884/py37-kb/lib/python3.7/site-packages/kb_python/utils.py", line 147, in run_executable
    raise sp.CalledProcessError(p.returncode, ' '.join(command))
subprocess.CalledProcessError: Command '/home/mm884/py37-kb/lib/python3.7/site-packages/kb_python/bins/linux/bustools/bustools sort -o tmp/output.s.bus -T tmp -t 12 -m 100000 ./pool1_7/output.bus' died with <Signals.SIGSEGV: 11>.
[2020-02-17 21:28:20,781]   DEBUG Removing tmp directory

The command output of kb ref:

[2020-02-17 01:48:55,434]    INFO Decompressing /n/data1/mgh/csb/pittet/references/mouse_mm10_99_star_updateFeb2020/Mus_musculus.GRCm38.dna_sm.primary_assembly.fa to tmp
[2020-02-17 01:48:55,435]    INFO Sorting /n/data1/mgh/csb/pittet/references/mouse_mm10_99_star_updateFeb2020/Mus_musculus.GRCm38.dna_sm.primary_assembly.fa
[2020-02-17 01:59:04,404]    INFO Decompressing /n/data1/mgh/csb/pittet/references/mouse_mm10_99_star_updateFeb2020/Mus_musculus.GRCm38.99.gtf to tmp
[2020-02-17 01:59:04,404]    INFO Sorting /n/data1/mgh/csb/pittet/references/mouse_mm10_99_star_updateFeb2020/Mus_musculus.GRCm38.99.gtf
[2020-02-17 02:00:11,059]    INFO Splitting genome into cDNA at /n/data1/mgh/csb/pittet/references/mouse_mm10_99_star_updateFeb2020/cdna.fa
[2020-02-17 02:01:53,486]    INFO Creating cDNA transcripts-to-capture at /n/data1/mgh/csb/pittet/references/mouse_mm10_99_star_updateFeb2020/cdna_t2c.txt
[2020-02-17 02:01:55,118]    INFO Splitting genome into introns at /n/data1/mgh/csb/pittet/references/mouse_mm10_99_star_updateFeb2020/intron.fa
[2020-02-17 02:07:16,038]    INFO Creating intron transcripts-to-capture at /n/data1/mgh/csb/pittet/references/mouse_mm10_99_star_updateFeb2020/cdna_t2c.txt
[2020-02-17 02:07:27,094]    INFO Concatenating cDNA and intron FASTAs
[2020-02-17 02:08:02,802]    INFO Creating transcript-to-gene mapping at /n/data1/mgh/csb/pittet/references/mouse_mm10_99_star_updateFeb2020/t2g.txt
[2020-02-17 02:08:16,075]    INFO Indexing to /n/data1/mgh/csb/pittet/references/mouse_mm10_99_star_updateFeb2020/index.idx

The file sizes in the kb count output directory:

-rw-rw-r-- 1 mm884 mm884 1.5G Feb 17 21:28 matrix.ec
-rw-rw-r-- 1 mm884 mm884 1.5K Feb 17 21:28 run_info.json
-rw-rw-r-- 1 mm884 mm884  19M Feb 17 21:28 transcripts.txt
-rw-rw-r-- 1 mm884 mm884   49 Feb 17 21:27 output.bus

And the file sizes in the kb ref output directory:

-rw-rw-r-- 1 mm884 pittet  26G Feb 17 03:54 index.idx
-rw-rw-r-- 1 mm884 pittet  41M Feb 17 02:08 t2g.txt
-rw-rw-r-- 1 mm884 pittet  16M Feb 17 02:07 intron_t2c.txt
-rw-rw-r-- 1 mm884 pittet 3.7G Feb 17 02:07 intron.fa
-rw-rw-r-- 1 mm884 pittet 2.9M Feb 17 02:01 cdna_t2c.txt
-rw-rw-r-- 1 mm884 pittet 247M Feb 17 02:01 cdna.fa

The versions that I used: python 3.7.4, kb-python 0.24.4, kallisto 0.46.1, and bustools 0.39.3

I run the command on a hpc cluster and allocated 12 cores and 100G memory.

Let me know if you need more information such as the fasta files. Thank you so much in advance for your help!

sbooeshaghi commented 4 years ago

A few things:

  1. I would do -m 100G for the memory, this is how the memory string is read
  2. This looks like inDropsv3 data format, is it? If so, the newest release of kallisto supports this, and we will be releasing kb-python soon that will support this structure.

For clarity this is the structure I am referring to: R1: Biological Read R2: Cell BC1 (1-8bp) R3: Index (1-8bp) UMI (9-14bp) R4: Cell BC2 (1-8bp)

mariusmessemaker commented 4 years ago

Thank you for your quick response! I will try to rerun with -m 100G; I will let you know whether that solves the issue. Does the newest version of kallisto have a inDropsv3 structure that also accepts a R3 library index file? If so, I will try to run kallisto. But it should be possible right to specify a new technology string with the library index of inDrops in R3 as a BC because in essence a library index is just a BC (that together with BC1 and BC2 specifies unique cells)?

In addition, should the UMI in your inDrops structure not be in R4? I use the following inDrops version 3 structure, which is the same as used in: https://github.com/indrops/indrops (I checked the reads manually): R1: Biological read R2: Cell BC1 (1-8 bp) R3: Index (1-8 bp) R4: Cell BC2 (1-8 bp) and UMI (9-14 bp).

Again, thank you so much for creating kallisto bus, and kb-python.

sbooeshaghi commented 4 years ago

The index file is only necessary if you wish to demultiplex samples that were pooled on the same lane, using the samplesheet.csv file that you create (if you used Illumina short read sequencing), see Illumina documentation. kallisto does not use the sample index.

Small side note, I made an error in the comment above, the UMI is in R4 and the structure is:

R1: Biological read
R2: Cell BC1 (1-8 bp)
R3: Index (1-8 bp)
R4: Cell BC2 (1-8 bp) and UMI (9-14 bp).

To process your reads lets look at main.cpp in the kallisto repo. We see the following lines:

      } else if (opt.technology == "INDROPSV3") {
        busopt.nfiles = 3;
        busopt.seq.push_back(BUSOptionSubstr(2,0,0));
        busopt.umi = BUSOptionSubstr(1,8,14);
        busopt.bc.push_back(BUSOptionSubstr(0,0,8));
        busopt.bc.push_back(BUSOptionSubstr(1,0,8));

In plain english: kallisto expects 3 files. Given how you have defined what R1,R2,R3,R4 mean, we note the that first half of the cell barcode comes from R2, the second half of the cell barcode comes from R4, the UMI comes from R4 and the biological read is in R1. So the command would be:

kallisto bus -i index.idx -o ./output -x inDropsv3 R2.fastq.gz R4.fastq.gz R1.fastq.gz

Where R2 is the 0th file, R4 is the 1st file and R1 is the 2nd file (0-indexed).

This works with the current release of kallisto and will be added to kb-python soon.

mariusmessemaker commented 4 years ago

Thank you for your reaction. Yes, I understood that I could not use technology = "INDROPSV3" because this kb technology specification expects 3 files that together contain BC1, BC2, UMI, and Biological read. Therefore, I used the opportunity of kb to specifiy a new technology myself that has 3 BC of which one BC is the library index because in essence a library index is just a BC that together with BC1 and BC2 specifies an unique cell. The issue that I raised here is that the specification of a new technology with 3 BCs, a UMI, and a Biological read does not work, while it should be possible to do this according to the kallisto bus documentation? I also tried to run with -m 100G and I get a different error <Signals.SIGKILL: 9> instead of <Signals.SIGSEGV: 11> but still 0 reads pseudo aligned and an empty bus file (which I think causes the error in the bustools sort command).

github-actions[bot] commented 4 years ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days