Open dariolez opened 1 year ago
Thanks for bringing up this issue; it seems to be specific for the lamanno workflow (which we're deprecating in favor of something better).
The way you're running the programs is correct, so it's an issue on our end.
I'm looking into this.
I've identified the issue: it's because the lamanno workflow is poorly implemented. Every single intron is considered a unique "transcript". You have k-mers in the index that can potentially map to over 200K different transcripts.
That's one reason we're deprecating lamanno (if you look at the devel version of kb-python; we have a new --workflow=nac that will entirely supersede lamanno).
Basically, building lamanno now takes 152 gigabytes of memory (though it only takes 25 gb to actually load it in, which is a substantial improvement). I admittedly did not consider lamanno when writing the index construction step of kallisto, since we were deprecating it anyway.
We just released the newest version of kallisto (0.50.1) which still contains this issue. However, I'll narrow down the issue in the code and see if there's a trivial way to fix it (if so, a new release will be put out very shortly; if not, well, consider upgrading ;) ).
I hope that makes sense! Let me know if you have any questions!
Thank you for answering so quickly.
I do have some more questions. But first, I have to give you some context for why I opened this issue with kallisto 0.50.0 in the first place:
I'm trying to perform RNA velocity on some 10x v2 scRNA-seq files that have 160-210M reads each (12-15GB per file). At first I tried using version 0.50.0, but because it didn't work, I resorted to version 0.48.0. This version can create an index within my available 125GB of RAM (the generated index file takes 45GB of storage) and can also pseudoalign. The problem is that the pseudoaligning step with kallisto 0.48.0 takes days for some of my files and I have to go one at a time with the current resources that I have. So, I tried to see if I could make kallisto 0.50.0 work because it had the improved indexing method. But because it didn't work by any means, I opened this issue and currently continue to use kallisto 0.48.0.
My questions:
1) If I understand correctly, what you are saying is that the new way of indexing in kallisto 0.50.0 takes 152GB of RAM during execution, but when it finishes, the index occupies 25GB of storage. Right? 2) Could the long execution times when I pseudoalign with kallisto 0.48.0 be caused by what you say about kmers mapping to thousands of transcripts? The R package BUSpaRse also treats each intron separately when generating files for RNA velocity, and I don't know of any other tools to obtain cDNA-intron FASTA files that do something different. 3) If I managed to generate an index with kallisto 0.50.0 (or 0.50.1), would the pseudoaligning be faster than with version 0.48.0?
I'm sorry if I deviated the topic slightly from the original issue. And thank you again for answering so fast.
I've finished writing a detailed manual -- will release it sometime this month. Happy to answer any questions in the meantime or walk you through things.
I had the same issue on an M2 mac. I had to conda install an older version of kallisto to get it to work (0.46.2)
Hello All:
I have a question related to the discussion above. I am trying to reproduce this tutorial:
https://www.kallistobus.tools/tutorials/kb_velocity/python/kb_velocity/
using the new version of kb (--workflow nac).
As of Dec 14, I can't run it with kb-python 0.28.0 because kb count runs out of memory (with over 100 Gb available).
1) Is there a working version of kb-python that uses the "nac" workflow?
2) If not, what are the latest versions of kallisto and bustools that I can install myself and that work with "nac" worklfow?
Regards, Nik
Can you show the commands you’re using and how you’re building the index (what FASTA/GTF files are being used) as well as the commands you’re using for kb count?
kb count (0.28.0 nac index) will not consume even a third of that amount of memory, so something is wrong on your end.
The commands are:
> pip show kb-python
Name: kb-python
Version: 0.28.0
> kb ref --workflow=nac -d human -i index.idx -g t2g.txt -c1 cdna.txt -c2 nascent.txt
> # SRR6470906
echo "Running SRR6470906 ..."
/usr/bin/time kb count --h5ad -i index.idx -g t2g.txt -x 10xv2 -o SRR6470906 \
-c1 cdna.txt -c2 nascent.txt --workflow nac -t 8 \
SRR6470906_S1_L001_R1_001.fastq.gz \
SRR6470906_S1_L001_R2_001.fastq.gz \
SRR6470906_S1_L002_R1_001.fastq.gz \
SRR6470906_S1_L002_R2_001.fastq.gz
The error message from kb count is:
Running SRR6470906 ...
[2023-12-18 12:51:10,532] INFO [count_nac] Using index index.idx to generate BUS file to SRR6470906 from
[2023-12-18 12:51:10,532] INFO [count_nac] SRR6470906_S1_L001_R1_001.fastq.gz
[2023-12-18 12:51:10,532] INFO [count_nac] SRR6470906_S1_L001_R2_001.fastq.gz
[2023-12-18 12:51:10,533] INFO [count_nac] SRR6470906_S1_L002_R1_001.fastq.gz
[2023-12-18 12:51:10,533] INFO [count_nac] SRR6470906_S1_L002_R2_001.fastq.gz
[2023-12-18 12:51:12,640] ERROR [count_nac]
[bus] Note: Strand option was not specified; setting it to --fr-stranded for specified technology
[2023-12-18 12:51:12,640] ERROR [main] An exception occurred
Traceback (most recent call last):
File "/usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/main.py", line 1618, in main
COMMAND_TO_FUNCTION[args.command](parser, args, temp_dir=temp_dir)
File "/usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/main.py", line 592, in parse_count
count_nac(
File "/usr/bin/python_env_3_10/lib/python3.10/site-packages/ngs_tools/logging.py", line 62, in inner
return func(*args, **kwargs)
File "/usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/count.py", line 1789, in count_nac
bus_result = kallisto_bus(
File "/usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/count.py", line 203, in kallisto_bus
run_executable(command)
File "/usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/dry/__init__.py", line 25, in inner
return func(*args, **kwargs)
File "/usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/utils.py", line 203, in run_executable
raise sp.CalledProcessError(p.returncode, ' '.join(command))
subprocess.CalledProcessError: Command '/usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/bins/linux/kallisto/kallisto bus -i index.idx -o SRR6470906 -x 10xv2 -t 8 SRR6470906_S1_L001_R1_001.fastq.gz SRR6470906_S1_L001_R2_001.fastq.gz SRR6470906_S1_L002_R1_001.fastq.gz SRR6470906_S1_L002_R2_001.fastq.gz' died with <Signals.SIGILL: 4>.
6.11user 3.17system 0:12.96elapsed 71%CPU (0avgtext+0avgdata 547996maxresident)k
254984inputs+0outputs (497major+166449minor)pagefaults 0swaps
Further, I used dmesg command and its output implies the memory consumption was over 100 Gb.
Can you run /usr/bin/time -v (include the -v)
And then use --verbose when using kb count?
It works on my end.
Edit: It takes 18 gb on my end for the human index (10 gb for the mouse index).
I tried the following:
/usr/bin/time -v kb count --h5ad -i index.idx -g t2g.txt -x 10xv2 -o SRR6470906 \
-c1 cdna.txt -c2 nascent.txt --workflow nac -t 8 --verbose \
SRR6470906_S1_L001_R1_001.fastq.gz \
SRR6470906_S1_L001_R2_001.fastq.gz \
SRR6470906_S1_L002_R1_001.fastq.gz \
SRR6470906_S1_L002_R2_001.fastq.gz
The output is:
[2023-12-18 14:58:14,234] DEBUG [main] Printing verbose output
[2023-12-18 14:58:16,440] DEBUG [main] kallisto binary located at /usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/bins/linux/kallisto/kallisto
[2023-12-18 14:58:16,441] DEBUG [main] bustools binary located at /usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/bins/linux/bustools/bustools
[2023-12-18 14:58:16,441] DEBUG [main] Creating `SRR6470906/tmp` directory
[2023-12-18 14:58:16,450] DEBUG [main] Namespace(list=False, command='count', tmp=None, keep_tmp=False, verbose=True, i='index.idx', g='t2g.txt', x='10xv2', o='SRR6470906', num=False, w=None, r=None, t=8, m='4G', strand=None, inleaved=False, genomebam=False, aa=False, gtf=None, chromosomes=None, workflow='nac', em=False, mm=False, tcc=False, filter=None, filter_threshold=None, c1='cdna.txt', c2='nascent.txt', overwrite=False, dry_run=False, batch_barcodes=False, loom=False, h5ad=True, loom_names='barcode,target_name', sum='none', cellranger=False, gene_names=False, N=None, report=False, no_inspect=False, kallisto='/usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/bins/linux/kallisto/kallisto', bustools='/usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/bins/linux/bustools/bustools', no_validate=False, no_fragment=False, parity=None, fragment_l=None, fragment_s=None, bootstraps=None, matrix_to_files=False, matrix_to_directories=False, fastqs=['SRR6470906_S1_L001_R1_001.fastq.gz', 'SRR6470906_S1_L001_R2_001.fastq.gz', 'SRR6470906_S1_L002_R1_001.fastq.gz', 'SRR6470906_S1_L002_R2_001.fastq.gz'])
[2023-12-18 14:58:19,275] INFO [count_nac] Using index index.idx to generate BUS file to SRR6470906 from
[2023-12-18 14:58:19,275] INFO [count_nac] SRR6470906_S1_L001_R1_001.fastq.gz
[2023-12-18 14:58:19,275] INFO [count_nac] SRR6470906_S1_L001_R2_001.fastq.gz
[2023-12-18 14:58:19,275] INFO [count_nac] SRR6470906_S1_L002_R1_001.fastq.gz
[2023-12-18 14:58:19,275] INFO [count_nac] SRR6470906_S1_L002_R2_001.fastq.gz
[2023-12-18 14:58:19,275] DEBUG [count_nac] kallisto bus -i index.idx -o SRR6470906 -x 10xv2 -t 8 SRR6470906_S1_L001_R1_001.fastq.gz SRR6470906_S1_L001_R2_001.fastq.gz SRR6470906_S1_L002_R1_001.fastq.gz SRR6470906_S1_L002_R2_001.fastq.gz
[2023-12-18 14:58:19,377] DEBUG [count_nac]
[2023-12-18 14:58:19,377] DEBUG [count_nac] [bus] Note: Strand option was not specified; setting it to --fr-stranded for specified technology
[2023-12-18 14:58:21,181] ERROR [count_nac]
[bus] Note: Strand option was not specified; setting it to --fr-stranded for specified technology
[2023-12-18 14:58:21,181] ERROR [main] An exception occurred
Traceback (most recent call last):
File "/usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/main.py", line 1618, in main
COMMAND_TO_FUNCTION[args.command](parser, args, temp_dir=temp_dir)
File "/usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/main.py", line 592, in parse_count
count_nac(
File "/usr/bin/python_env_3_10/lib/python3.10/site-packages/ngs_tools/logging.py", line 62, in inner
return func(*args, **kwargs)
File "/usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/count.py", line 1789, in count_nac
bus_result = kallisto_bus(
File "/usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/count.py", line 203, in kallisto_bus
run_executable(command)
File "/usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/dry/__init__.py", line 25, in inner
return func(*args, **kwargs)
File "/usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/utils.py", line 203, in run_executable
raise sp.CalledProcessError(p.returncode, ' '.join(command))
subprocess.CalledProcessError: Command '/usr/bin/python_env_3_10/lib/python3.10/site-packages/kb_python/bins/linux/kallisto/kallisto bus -i index.idx -o SRR6470906 -x 10xv2 -t 8 SRR6470906_S1_L001_R1_001.fastq.gz SRR6470906_S1_L001_R2_001.fastq.gz SRR6470906_S1_L002_R1_001.fastq.gz SRR6470906_S1_L002_R2_001.fastq.gz' died with <Signals.SIGILL: 4>.
[2023-12-18 14:58:21,183] DEBUG [main] Removing `SRR6470906/tmp` directory
Command being timed: "kb count --h5ad -i index.idx -g t2g.txt -x 10xv2 -o SRR6470906 -c1 cdna.txt -c2 nascent.txt --workflow nac -t 8 --verbose SRR6470906_S1_L001_R1_001.fastq.gz SRR6470906_S1_L001_R2_001.fastq.gz SRR6470906_S1_L002_R1_001.fastq.gz SRR6470906_S1_L002_R2_001.fastq.gz"
User time (seconds): 5.57
System time (seconds): 2.75
Percent of CPU this job got: 75%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:10.99
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 548888
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 142372
Voluntary context switches: 26784
Involuntary context switches: 493145
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
It says "Maximum resident set size (kbytes): 548888"
That is only 0.5 gigabytes.
The "Signals.SIGILL: 4" means illegal instruction. That likely means that the prepackaged binaries do NOT work on your system and that you need to compile kallisto (and possibly bustools) from source.
See the instructions on the first page here: https://www.biorxiv.org/content/biorxiv/early/2023/11/22/2023.11.21.568164/DC1/embed/media-1.pdf for information on how to compile from source on your system and how to use your source-compiled kallisto+bustools within kb-python,
Hello Delaney:
I used the pdf instruction and it works. Thanks a lot!
Regards, Nik Tuzov
How do you download the specific older version of 0.48.0 in conda?
To install a specific version using conda or mamba you can do:
mamba install -c bioconda 'kallisto=0.48.0'
This is explained in the conda documentation
To see the available versions of a package you can search here.
The option -c bioconda
can be omitted if you have bioconda configured as a channel.
I'm trying to perform RNA velocity with kallisto, bustools and their wrapper kb-python following the instructions in this R Notebook. But I'm unable to generate an index with kallisto 0.50.0.
Summary of what I tried
1) kallisto 0.50.0 from bioconda: it fails with
Illegal instruction (core dumped)
2) kallisto 0.50.0 binary from GitHub: it overflows 125GB of RAM + 15GB of swap and is killed by the system (died with <Signals.SIGKILL: 9>
) 3) kallisto 0.50.0 compiled from source: it again overflows the RAM and ends withdied with <Signals.SIGKILL: 9>
I include more information of hardware and commands in the next section if you need it.
According to the release notes for kallisto 0.50.0 "The improved kallisto index reduces memory consumption for large FASTA files", but with this version I can't generate an index because it collapses the RAM and with version 0.48.0 I can.
Is it normal for it to use up so much RAM? Am I missing something?
Supporting information
I have run all commands in a computer with an Intel i7-6950X, 125GB of RAM, 120GB of free storage space, and Ubuntu 22.04.3 installed.
1) Using kallisto 0.50.0 from bioconda.
I tested the version from bioconda using the test folder from kallisto's GitHub page. I also tested this on a different computer with AMD Ryzen 7-5800H and 16GB of RAM and I got the same error.
I ran:
Output:
2 & 3) Using kallisto 0.50.0 binary from GitHub and compiled from source
Both of these versions can process the files in the test folder without errors. But when I try to index the RNA velocity transcriptome it overflows the 125GB of RAM.
I ran:
This is the output I get: