pachterlab / kallisto

Near-optimal RNA-Seq quantification
https://pachterlab.github.io/kallisto
BSD 2-Clause "Simplified" License
656 stars 172 forks source link

kallisto 0.50.0 can't index #411

Open dariolez opened 1 year ago

dariolez commented 1 year ago

I'm trying to perform RNA velocity with kallisto, bustools and their wrapper kb-python following the instructions in this R Notebook. But I'm unable to generate an index with kallisto 0.50.0.

Summary of what I tried

1) kallisto 0.50.0 from bioconda: it fails with Illegal instruction (core dumped) 2) kallisto 0.50.0 binary from GitHub: it overflows 125GB of RAM + 15GB of swap and is killed by the system (died with <Signals.SIGKILL: 9>) 3) kallisto 0.50.0 compiled from source: it again overflows the RAM and ends with died with <Signals.SIGKILL: 9>

I include more information of hardware and commands in the next section if you need it.

According to the release notes for kallisto 0.50.0 "The improved kallisto index reduces memory consumption for large FASTA files", but with this version I can't generate an index because it collapses the RAM and with version 0.48.0 I can.

Is it normal for it to use up so much RAM? Am I missing something?

Supporting information

I have run all commands in a computer with an Intel i7-6950X, 125GB of RAM, 120GB of free storage space, and Ubuntu 22.04.3 installed.

1) Using kallisto 0.50.0 from bioconda.

I tested the version from bioconda using the test folder from kallisto's GitHub page. I also tested this on a different computer with AMD Ryzen 7-5800H and 16GB of RAM and I got the same error.

I ran:

# Create an environment for kallisto and bustools
mamba create -n kallistobustools kallisto bustools  # this installs v0.50.0 and v0.43.0
conda activate kallistobustools

# Test kallisto
cd kallisto/test/

kallisto index -i transcripts.idx transcripts.fasta.gz

Output:

Illegal instruction (core dumped)

2 & 3) Using kallisto 0.50.0 binary from GitHub and compiled from source

Both of these versions can process the files in the test folder without errors. But when I try to index the RNA velocity transcriptome it overflows the 125GB of RAM.

I ran:

kb ref -i index.idx -g t2g.txt -f1 cdna
.fa -f2 intron.fa -c1 cdna_t2c.txt -c2 intron_t2c.txt --workflow lamanno --kallisto ~/bin/kallisto_v0.50.0_source/build/
src/kallisto --bustools ~/bin/bustools_v0.43.0/bustools ~/TFM/reference_genomes/Homo_sapiens.GRCh38.dna.primary_assembly
.fa.gz ~/TFM/reference_genomes/Homo_sapiens.GRCh38.110.gtf.gz 2>&1 | tee kallisto_0.50.0_source_index.log

This is the output I get:

[2023-10-31 15:05:31,113]    INFO [ref_lamanno] Skipping cDNA and intron FASTA generation because files already exist. U
se --overwrite flag to overwrite
[2023-10-31 15:05:31,114]    INFO [ref_lamanno] Concatenating cDNA and intron FASTAs to /home/dario/TFM/reference_genome
s/kallisto_index_v0.50.0_source/tmp/tmpz8s2xva8
[2023-10-31 15:08:14,362]    INFO [ref_lamanno] Creating transcript-to-gene mapping at t2g.txt
[2023-10-31 15:08:33,928]    INFO [ref_lamanno] Indexing /home/dario/TFM/reference_genomes/kallisto_index_v0.50.0_source
/tmp/tmpz8s2xva8 to index.idx
[2023-10-31 17:36:56,586]   ERROR [ref_lamanno]
[build] loading fasta file /home/dario/TFM/reference_genomes/kallisto_index_v0.50.0_source/tmp/tmpz8s2xva8
[build] k-mer length: 31
[build] warning: clipped off poly-A tail (longer than 10)
from 2068 target sequences
[build] warning: replaced 2853594 non-ACGUT characters in the input sequence
with pseudorandom nucleotides
KmerStream::KmerStream(): Start computing k-mer cardinality estimations (1/2)
KmerStream::KmerStream(): Start computing k-mer cardinality estimations (1/2)
KmerStream::KmerStream(): Finished
CompactedDBG::build(): Estimated number of k-mers occurring at least once: 1586249689
CompactedDBG::build(): Estimated number of minimizer occurring at least once: 375923143
CompactedDBG::filter(): Processed 10464561000 k-mers in 1546082 reads
CompactedDBG::filter(): Found 1585647845 unique k-mers
CompactedDBG::filter(): Number of blocks in Bloom filter is 10843505
CompactedDBG::construct(): Extract approximate unitigs (1/2)
CompactedDBG::construct(): Extract approximate unitigs (2/2)
CompactedDBG::construct(): Closed all input files

CompactedDBG::construct(): Splitting unitigs (1/2)

CompactedDBG::construct(): Splitting unitigs (2/2)
CompactedDBG::construct(): Before split: 24047623 unitigs
CompactedDBG::construct(): After split (1/1): 24047623 unitigs
CompactedDBG::construct(): Unitigs split: 902
CompactedDBG::construct(): Unitigs deleted: 0

CompactedDBG::construct(): Joining unitigs
CompactedDBG::construct(): After join: 22349633 unitigs
CompactedDBG::construct(): Joined 1698149 unitigs
[build] building MPHF
[build] creating equivalence classes ...
[2023-10-31 17:36:56,697]   ERROR [main] An exception occurred
Traceback (most recent call last):
  File "/home/dario/miniforge3/envs/d_kb-python/lib/python3.9/site-packages/kb_python/main.py", line 1305, in main
    COMMAND_TO_FUNCTION[args.command](parser, args, temp_dir=temp_dir)
  File "/home/dario/miniforge3/envs/d_kb-python/lib/python3.9/site-packages/kb_python/main.py", line 249, in parse_ref
    ref_lamanno(
  File "/home/dario/miniforge3/envs/d_kb-python/lib/python3.9/site-packages/ngs_tools/logging.py", line 62, in inner
    return func(*args, **kwargs)
  File "/home/dario/miniforge3/envs/d_kb-python/lib/python3.9/site-packages/kb_python/ref.py", line 746, in ref_lamanno
    index_result = kallisto_index(combined_path, index_path, k=k or 31)   
  File "/home/dario/miniforge3/envs/d_kb-python/lib/python3.9/site-packages/kb_python/ref.py", line 239, in kallisto_index
    run_executable(command)
  File "/home/dario/miniforge3/envs/d_kb-python/lib/python3.9/site-packages/kb_python/dry/__init__.py", line 25, in inner
    return func(*args, **kwargs)
  File "/home/dario/miniforge3/envs/d_kb-python/lib/python3.9/site-packages/kb_python/utils.py", line 203, in run_executable
    raise sp.CalledProcessError(p.returncode, ' '.join(command))
subprocess.CalledProcessError: Command '/home/dario/bin/kallisto_v0.50.0_source/build/src/kallisto index -i index.idx -k 31 /home/dario/TFM/reference_genomes/kallisto_index_v0.50.0_source/tmp/tmpz8s2xva8' died with <Signals.SIGKILL: 9>.
Yenaled commented 1 year ago

Thanks for bringing up this issue; it seems to be specific for the lamanno workflow (which we're deprecating in favor of something better).

The way you're running the programs is correct, so it's an issue on our end.

I'm looking into this.

Yenaled commented 1 year ago

I've identified the issue: it's because the lamanno workflow is poorly implemented. Every single intron is considered a unique "transcript". You have k-mers in the index that can potentially map to over 200K different transcripts.

That's one reason we're deprecating lamanno (if you look at the devel version of kb-python; we have a new --workflow=nac that will entirely supersede lamanno).

Basically, building lamanno now takes 152 gigabytes of memory (though it only takes 25 gb to actually load it in, which is a substantial improvement). I admittedly did not consider lamanno when writing the index construction step of kallisto, since we were deprecating it anyway.

We just released the newest version of kallisto (0.50.1) which still contains this issue. However, I'll narrow down the issue in the code and see if there's a trivial way to fix it (if so, a new release will be put out very shortly; if not, well, consider upgrading ;) ).

I hope that makes sense! Let me know if you have any questions!

dariolez commented 1 year ago

Thank you for answering so quickly.

I do have some more questions. But first, I have to give you some context for why I opened this issue with kallisto 0.50.0 in the first place:

I'm trying to perform RNA velocity on some 10x v2 scRNA-seq files that have 160-210M reads each (12-15GB per file). At first I tried using version 0.50.0, but because it didn't work, I resorted to version 0.48.0. This version can create an index within my available 125GB of RAM (the generated index file takes 45GB of storage) and can also pseudoalign. The problem is that the pseudoaligning step with kallisto 0.48.0 takes days for some of my files and I have to go one at a time with the current resources that I have. So, I tried to see if I could make kallisto 0.50.0 work because it had the improved indexing method. But because it didn't work by any means, I opened this issue and currently continue to use kallisto 0.48.0.

My questions:

1) If I understand correctly, what you are saying is that the new way of indexing in kallisto 0.50.0 takes 152GB of RAM during execution, but when it finishes, the index occupies 25GB of storage. Right? 2) Could the long execution times when I pseudoalign with kallisto 0.48.0 be caused by what you say about kmers mapping to thousands of transcripts? The R package BUSpaRse also treats each intron separately when generating files for RNA velocity, and I don't know of any other tools to obtain cDNA-intron FASTA files that do something different. 3) If I managed to generate an index with kallisto 0.50.0 (or 0.50.1), would the pseudoaligning be faster than with version 0.48.0?

I'm sorry if I deviated the topic slightly from the original issue. And thank you again for answering so fast.

Yenaled commented 1 year ago
  1. Correct
  2. Yes
  3. Likely (however, I haven't tested it out with the lamanno workflow). If you use our upgraded workflow (--workflow=nac), you'll get much lower memory usage (for both index generation and pseudoalignment) and much, much lower runtimes (and far better accuracy too). It's currently on the devel branch of kb_python.

I've finished writing a detailed manual -- will release it sometime this month. Happy to answer any questions in the meantime or walk you through things.

williamtbarker commented 1 year ago

I had the same issue on an M2 mac. I had to conda install an older version of kallisto to get it to work (0.46.2)

NikTuzov commented 11 months ago

Hello All:

I have a question related to the discussion above. I am trying to reproduce this tutorial:

https://www.kallistobus.tools/tutorials/kb_velocity/python/kb_velocity/

using the new version of kb (--workflow nac).

As of Dec 14, I can't run it with kb-python 0.28.0 because kb count runs out of memory (with over 100 Gb available).

1) Is there a working version of kb-python that uses the "nac" workflow?

2) If not, what are the latest versions of kallisto and bustools that I can install myself and that work with "nac" worklfow?

Regards, Nik

Yenaled commented 11 months ago

Can you show the commands you’re using and how you’re building the index (what FASTA/GTF files are being used) as well as the commands you’re using for kb count?

kb count (0.28.0 nac index) will not consume even a third of that amount of memory, so something is wrong on your end.

NikTuzov commented 11 months ago

The commands are:

> pip show kb-python
Name: kb-python
Version: 0.28.0

> kb ref --workflow=nac -d human -i index.idx -g t2g.txt -c1 cdna.txt -c2 nascent.txt

> # SRR6470906
echo "Running SRR6470906 ..."
/usr/bin/time kb count --h5ad -i index.idx -g t2g.txt -x 10xv2 -o SRR6470906 \
-c1 cdna.txt -c2 nascent.txt --workflow nac -t 8 \
SRR6470906_S1_L001_R1_001.fastq.gz \
SRR6470906_S1_L001_R2_001.fastq.gz \
SRR6470906_S1_L002_R1_001.fastq.gz \
SRR6470906_S1_L002_R2_001.fastq.gz 

The error message from kb count is:

Running SRR6470906 ...
[2023-12-18 12:51:10,532]    INFO [count_nac] Using index index.idx to generate BUS file to SRR6470906 from
[2023-12-18 12:51:10,532]    INFO [count_nac]         SRR6470906_S1_L001_R1_001.fastq.gz
[2023-12-18 12:51:10,532]    INFO [count_nac]         SRR6470906_S1_L001_R2_001.fastq.gz
[2023-12-18 12:51:10,533]    INFO [count_nac]         SRR6470906_S1_L002_R1_001.fastq.gz
[2023-12-18 12:51:10,533]    INFO [count_nac]         SRR6470906_S1_L002_R2_001.fastq.gz
[2023-12-18 12:51:12,640]   ERROR [count_nac] 
[bus] Note: Strand option was not specified; setting it to --fr-stranded for specified technology
[2023-12-18 12:51:12,640]   ERROR [main] An exception occurred
Traceback (most recent call last):
  File "/usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/main.py", line 1618, in main
    COMMAND_TO_FUNCTION[args.command](parser, args, temp_dir=temp_dir)
  File "/usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/main.py", line 592, in parse_count
    count_nac(
  File "/usr/bin/python_env_3_10/lib/python3.10/site-packages/ngs_tools/logging.py", line 62, in inner
    return func(*args, **kwargs)
  File "/usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/count.py", line 1789, in count_nac
    bus_result = kallisto_bus(
  File "/usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/count.py", line 203, in kallisto_bus
    run_executable(command)
  File "/usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/dry/__init__.py", line 25, in inner
    return func(*args, **kwargs)
  File "/usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/utils.py", line 203, in run_executable
    raise sp.CalledProcessError(p.returncode, ' '.join(command))
subprocess.CalledProcessError: Command '/usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/bins/linux/kallisto/kallisto bus -i index.idx -o SRR6470906 -x 10xv2 -t 8 SRR6470906_S1_L001_R1_001.fastq.gz SRR6470906_S1_L001_R2_001.fastq.gz SRR6470906_S1_L002_R1_001.fastq.gz SRR6470906_S1_L002_R2_001.fastq.gz' died with <Signals.SIGILL: 4>.
6.11user 3.17system 0:12.96elapsed 71%CPU (0avgtext+0avgdata 547996maxresident)k
254984inputs+0outputs (497major+166449minor)pagefaults 0swaps

Further, I used dmesg command and its output implies the memory consumption was over 100 Gb.

Yenaled commented 11 months ago

Can you run /usr/bin/time -v (include the -v)

And then use --verbose when using kb count?

It works on my end.

Edit: It takes 18 gb on my end for the human index (10 gb for the mouse index).

NikTuzov commented 11 months ago

I tried the following:

/usr/bin/time -v kb count --h5ad -i index.idx -g t2g.txt -x 10xv2 -o SRR6470906 \
-c1 cdna.txt -c2 nascent.txt --workflow nac -t 8 --verbose \
SRR6470906_S1_L001_R1_001.fastq.gz \
SRR6470906_S1_L001_R2_001.fastq.gz \
SRR6470906_S1_L002_R1_001.fastq.gz \
SRR6470906_S1_L002_R2_001.fastq.gz 

The output is:

[2023-12-18 14:58:14,234]   DEBUG [main] Printing verbose output
[2023-12-18 14:58:16,440]   DEBUG [main] kallisto binary located at /usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/bins/linux/kallisto/kallisto
[2023-12-18 14:58:16,441]   DEBUG [main] bustools binary located at /usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/bins/linux/bustools/bustools
[2023-12-18 14:58:16,441]   DEBUG [main] Creating `SRR6470906/tmp` directory
[2023-12-18 14:58:16,450]   DEBUG [main] Namespace(list=False, command='count', tmp=None, keep_tmp=False, verbose=True, i='index.idx', g='t2g.txt', x='10xv2', o='SRR6470906', num=False, w=None, r=None, t=8, m='4G', strand=None, inleaved=False, genomebam=False, aa=False, gtf=None, chromosomes=None, workflow='nac', em=False, mm=False, tcc=False, filter=None, filter_threshold=None, c1='cdna.txt', c2='nascent.txt', overwrite=False, dry_run=False, batch_barcodes=False, loom=False, h5ad=True, loom_names='barcode,target_name', sum='none', cellranger=False, gene_names=False, N=None, report=False, no_inspect=False, kallisto='/usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/bins/linux/kallisto/kallisto', bustools='/usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/bins/linux/bustools/bustools', no_validate=False, no_fragment=False, parity=None, fragment_l=None, fragment_s=None, bootstraps=None, matrix_to_files=False, matrix_to_directories=False, fastqs=['SRR6470906_S1_L001_R1_001.fastq.gz', 'SRR6470906_S1_L001_R2_001.fastq.gz', 'SRR6470906_S1_L002_R1_001.fastq.gz', 'SRR6470906_S1_L002_R2_001.fastq.gz'])
[2023-12-18 14:58:19,275]    INFO [count_nac] Using index index.idx to generate BUS file to SRR6470906 from
[2023-12-18 14:58:19,275]    INFO [count_nac]         SRR6470906_S1_L001_R1_001.fastq.gz
[2023-12-18 14:58:19,275]    INFO [count_nac]         SRR6470906_S1_L001_R2_001.fastq.gz
[2023-12-18 14:58:19,275]    INFO [count_nac]         SRR6470906_S1_L002_R1_001.fastq.gz
[2023-12-18 14:58:19,275]    INFO [count_nac]         SRR6470906_S1_L002_R2_001.fastq.gz
[2023-12-18 14:58:19,275]   DEBUG [count_nac] kallisto bus -i index.idx -o SRR6470906 -x 10xv2 -t 8 SRR6470906_S1_L001_R1_001.fastq.gz SRR6470906_S1_L001_R2_001.fastq.gz SRR6470906_S1_L002_R1_001.fastq.gz SRR6470906_S1_L002_R2_001.fastq.gz
[2023-12-18 14:58:19,377]   DEBUG [count_nac] 
[2023-12-18 14:58:19,377]   DEBUG [count_nac] [bus] Note: Strand option was not specified; setting it to --fr-stranded for specified technology
[2023-12-18 14:58:21,181]   ERROR [count_nac] 
[bus] Note: Strand option was not specified; setting it to --fr-stranded for specified technology
[2023-12-18 14:58:21,181]   ERROR [main] An exception occurred
Traceback (most recent call last):
  File "/usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/main.py", line 1618, in main
    COMMAND_TO_FUNCTION[args.command](parser, args, temp_dir=temp_dir)
  File "/usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/main.py", line 592, in parse_count
    count_nac(
  File "/usr/bin/python_env_3_10/lib/python3.10/site-packages/ngs_tools/logging.py", line 62, in inner
    return func(*args, **kwargs)
  File "/usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/count.py", line 1789, in count_nac
    bus_result = kallisto_bus(
  File "/usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/count.py", line 203, in kallisto_bus
    run_executable(command)
  File "/usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/dry/__init__.py", line 25, in inner
    return func(*args, **kwargs)
  File "/usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/utils.py", line 203, in run_executable
    raise sp.CalledProcessError(p.returncode, ' '.join(command))
subprocess.CalledProcessError: Command '/usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/bins/linux/kallisto/kallisto bus -i index.idx -o SRR6470906 -x 10xv2 -t 8 SRR6470906_S1_L001_R1_001.fastq.gz SRR6470906_S1_L001_R2_001.fastq.gz SRR6470906_S1_L002_R1_001.fastq.gz SRR6470906_S1_L002_R2_001.fastq.gz' died with <Signals.SIGILL: 4>.
[2023-12-18 14:58:21,183]   DEBUG [main] Removing `SRR6470906/tmp` directory
        Command being timed: "kb count --h5ad -i index.idx -g t2g.txt -x 10xv2 -o SRR6470906 -c1 cdna.txt -c2 nascent.txt --workflow nac -t 8 --verbose SRR6470906_S1_L001_R1_001.fastq.gz SRR6470906_S1_L001_R2_001.fastq.gz SRR6470906_S1_L002_R1_001.fastq.gz SRR6470906_S1_L002_R2_001.fastq.gz"
        User time (seconds): 5.57
        System time (seconds): 2.75
        Percent of CPU this job got: 75%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:10.99
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 548888
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 142372
        Voluntary context switches: 26784
        Involuntary context switches: 493145
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
Yenaled commented 11 months ago

It says "Maximum resident set size (kbytes): 548888"

That is only 0.5 gigabytes.

The "Signals.SIGILL: 4" means illegal instruction. That likely means that the prepackaged binaries do NOT work on your system and that you need to compile kallisto (and possibly bustools) from source.

See the instructions on the first page here: https://www.biorxiv.org/content/biorxiv/early/2023/11/22/2023.11.21.568164/DC1/embed/media-1.pdf for information on how to compile from source on your system and how to use your source-compiled kallisto+bustools within kb-python,

NikTuzov commented 11 months ago

Hello Delaney:

I used the pdf instruction and it works. Thanks a lot!

Regards, Nik Tuzov

jtl429 commented 6 months ago

How do you download the specific older version of 0.48.0 in conda?

dariolez commented 6 months ago

To install a specific version using conda or mamba you can do:

mamba install -c bioconda 'kallisto=0.48.0'

This is explained in the conda documentation

To see the available versions of a package you can search here.

The option -c bioconda can be omitted if you have bioconda configured as a channel.