pachterlab / kb_python

A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing
https://www.kallistobus.tools/
BSD 2-Clause "Simplified" License
147 stars 23 forks source link

Error during kb ref #117

Closed MikeAskenase closed 3 years ago

MikeAskenase commented 3 years ago

Describe the issue I am attempting to build a reference based on the GRCh38 human reference genome for use in scvelo downstream, but I am receiving an error I don't understand. I downloaded the reference genomes from here: https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.39/

Then extracted the FNA and GTF files from the TAR and subsequent GZIP files.

I am using kb_python 0.26.0.

What is the exact command that was run?

kb ref -i index.idx -g t2g.txt -f1 cdna.fa -f2 intron.fa -c1 cdna_t2c.txt -c2 intron_t2c.txt --workflow lamanno -n 4 \ ~/scratch60/RNA_velocity/FASTA_GTF_files/GCF_000001405.39_GRCh38.p13_genomic.fna \ ~/scratch60/RNA_velocity/FASTA_GTF_files/GCF_000001405.39_GRCh38.p13_genomic.gtf

Command output

[2021-04-13 13:13:36,026] INFO Preparing /[redacted]/scratch60/RNA_velocity/FASTA_GTF_files/GCF_000001405.39_GRCh38.p13_genomic.fna, /[redacted]/scratch60/RNA_velocity/FASTA_GT F_files/GCF_000001405.39_GRCh38.p13_genomic.gtf [2021-04-13 13:13:36,026] INFO Sorting /[redacted]/scratch60/RNA_velocity/FASTA_GTF_files/GCF_000001405.39_GRCh38.p13_genomic.fna to /[redacted]/RNA_velocity/FASTA_GTF_files/tmp/tmpw5itd_hl [2021-04-13 13:19:29,585] INFO Sorting /[redacted]/scratch60/RNA_velocity/FASTA_GTF_files/GCF_000001405.39_GRCh38.p13_genomic.gtf to /[redacted]/RNA_velocity/FASTA_GTF_files/tmp/tmpwljvet_g [2021-04-13 13:21:11,759] INFO Splitting genome /[redacted]/scratch60/RNA_velocity/FASTA_GTF_files/GCF_000001405.39_GRCh38.p13_genomic.fna into cDNA at /[redacted]/RNA_velocity/FASTA_GTF_files/tmp/tmp2gj2zp0n [2021-04-13 13:21:11,760] WARNING The following chromosomes were found in the FASTA but does not have any "transcript" features in the GTF: NW_003315967.2, NT_187469.1, NW_019805487.1, NT_187438.1, NT_187490.1, NT_187394.1, NT_187384.1, NT_187364.1, NW_018654707.1, NW_021160002.1, NT_187436.1, NT_187466.1, NT_187530.1, NT_187541.1, NW_003315938.1, NT_187452.1, NT_187409.1, NT_187457.1, NT_187531.1, NT_187487.1, NW_003315954.1, NT_187415.1, NW_003315931.1, NT_187483.1, NT_187459.1, NT_187370.1, NW_012132921.1, NT_187544.1, NT_187410.1, NT_187403.1, NW_021160024.1, NW_021160009.1, NT_187475.1, NT_187554.1, NW_021160029.1, NT_187455.1, NT_187426.1, NT_187480.1, NT_187615.1, NT_113948.1, NW_013171804.1, NW_003315928.1, NT_187492.1, NT_187416.1, NT_187448.1, NT_187476.1, NT_187504.1, NT_187431.1, NT_187493.1, NT_187412.1, NT_187428.1, NT_167220.1, NT_187471.1, NT_187462.1, NT_187391.1, NW_021159992.1, NT_187429.1, NT_187450.1, NT_187591.1, NT_187460.1, NT_187442.1, NW_021159997.1, NT_187396.1, NT_187464.1, NT_187454.1, NT_187427.1, NT_187511.1, NT_187406.1, NT_187401.1, NW_003315959.1, NT_187479.1, NT_187419.1, NT_187461.1, NT_187407.1, NW_019805501.1, NT_187521.1, NT_187430.1, NT_187482.1, NT_187445.1, NW_012132917.1, NW_021160026.1, NT_187435.1, NT_187489.1, NW_014040925.1, NW_003315957.1, NW_012132918.1, NT_187477.1, NT_187473.1, NW_021160030.1, NW_021160025.1, NT_187365.1, NT_187486.1, NW_021160018.1, NT_187405.1, NW_021159999.1, NT_187372.1, NT_187441.1, NT_187390.1, NT_187496.1, NT_187395.1, NT_167211.2, NT_187444.1, NT_187421.1, NT_187494.1, NT_187474.1, NT_187458.1, NW_018654725.1, NT_187488.1, NT_167219.1, NW_017363820.1, NT_187449.1, NW_021159994.1, NT_187512.1, NT_187502.1, NT_187456.1, NT_187363.1, NT_187485.1, NT_187597.1, NT_187417.1, NT_187451.1, NT_187377.1, NT_187392.1, NW_003315936.1, NT_187433.1, NW_021159988.1, NT_187387.1, NW_021159995.1, NT_187616.1, NW_009646205.1, NT_187369.1, NT_187414.1, NW_013171810.1, NT_187393.1, NT_187467.1, NT_187413.1, NT_187434.1, NW_009646200.1, NT_187595.1, NT_187418.1, NT_187398.1, NT_187446.1, NT_187439.1, NT_187465.1, NT_187440.1, NT_187371.1, NT_187484.1, NT_187618.1, NT_187432.1, NT_187422.1, NT_187478.1, NT_187495.1, NW_013171803.1, NW_019805493.1, NT_187362.1, NT_187472.1, NW_021160031.1, NT_187443.1, NT_187425.1, NT_187379.1, NT_187385.1, NT_187470.1, NT_187481.1, NT_187399.1, NW_018654709.1, NT_187423.1, NW_009646204.1, NW_013171805.1, NW_011332692.1, NW_019805488.1, NT_187400.1, NT_187408.1, NW_011332694.1 , NT_187437.1, NT_187411.1, NT_187402.1, NW_021160010.1, NT_187453.1, NT_187424.1, NT_187378.1, NT_187468.1, NT_187397.1, NT_187563.1, NT_187536.1, NT_187447.1, NT_187404.1, NT_18 7491.1, NT_187463.1. No sequences will be generated for these chromosomes. [2021-04-13 13:22:34,177] INFO Wrote 178214 cDNA transcripts [2021-04-13 13:22:34,178] INFO Creating cDNA transcripts-to-capture at /[redacted]/RNA_velocity/FASTA_GTF_files/tmp/tmp1zhvm9rh [2021-04-13 13:22:36,914] INFO Splitting genome into introns at /[redacted]/RNA_velocity/FASTA_GTF_files/tmp/tmpys1pkh5u [2021-04-13 13:32:50,902] INFO Wrote 1691521 intron sequences [2021-04-13 13:32:50,905] INFO Creating intron transcripts-to-capture at /[redacted]/RNA_velocity/FASTA_GTF_files/tmp/tmpfj4o8sn4 [2021-04-13 13:33:24,154] INFO Concatenating 1 cDNA FASTAs to cdna.fa [2021-04-13 13:33:25,552] INFO Concatenating 1 cDNA transcripts-to-captures to cdna_t2c.txt [2021-04-13 13:33:25,638] INFO Concatenating 1 intron FASTAs to intron.fa [2021-04-13 13:33:47,620] INFO Concatenating 1 intron transcripts-to-captures to intron_t2c.txt [2021-04-13 13:33:48,114] INFO Concatenating cDNA and intron FASTAs to /[redacted]/RNA_velocity/FASTA_GTF_files/tmp/tmpit2le3pg [2021-04-13 13:34:11,801] INFO Creating transcript-to-gene mapping at t2g.txt [2021-04-13 13:34:52,889] INFO Indexing cdna.fa to index.idx_cdna [2021-04-13 13:35:45,921] ERROR [build] loading fasta file cdna.fa [build] k-mer length: 31 [build] warning: clipped off poly-A tail (longer than 10) from 1487 target sequences [build] warning: replaced 14 non-ACGUT characters in the input sequence with pseudorandom nucleotides [build] counting k-mers ... [2021-04-13 13:35:45,922] ERROR An exception occurred Traceback (most recent call last): File "/[redacted]/conda_envs/R_4_env/lib/python3.9/site-packages/kb_python/main.py", line 866, in main COMMAND_TO_FUNCTION[args.command](parser, args, temp_dir=temp_dir) File "/[redacted]/conda_envs/R_4_env/lib/python3.9/site-packages/kb_python/main.py", line 126, in parse_ref ref_lamanno( File "/[redacted]/conda_envs/R_4_env/lib/python3.9/site-packages/kb_python/ref.py", line 673, in ref_lamanno cdna_index_result = kallisto_index( File "/[redacted]/conda_envs/R_4_env/lib/python3.9/site-packages/kb_python/ref.py", line 212, in kallisto_index run_executable(command) File "/[redacted]/conda_envs/R_4_env/lib/python3.9/site-packages/kb_python/dry/init.py", line 24, in inner return func(*args, **kwargs) File "/[redacted]/conda_envs/R_4_env/lib/python3.9/site-packages/kb_python/utils.py", line 237, in run_executable raise sp.CalledProcessError(p.returncode, ' '.join(command)) subprocess.CalledProcessError: Command '/[redacted]/conda_envs/R_4_env/lib/python3.9/site-packages/kb_python/bins/linux/kallisto/kallisto index -i index.idx_c dna -k 31 cdna.fa' died with <Signals.SIGBUS: 7>.

My knowledge of python is fairly limited, but I am hoping we can resolve this issue. Thanks very much!

Lioscro commented 3 years ago

Hi, @MikeAskenase, I've never seen a SIGBUS error before. I'll try this on my side and see what I get.

Ruth-hals commented 3 years ago

Hi - i'm also having a similar issue. I tried this first on my own gtf and then on the standard mouse gtf and the error is the same:

No sequences will be generated for these chromosomes. [2021-04-30 17:14:50,562] ERROR An exception occurred Traceback (most recent call last): File "/data/leuven/329/vsc32990/miniconda3/envs/velocity/lib/python3.9/site-packages/kb_python/main.py", line 876, in main COMMAND_TO_FUNCTION[args.command](parser, args, temp_dir=temp_dir) File "/data/leuven/329/vsc32990/miniconda3/envs/velocity/lib/python3.9/site-packages/kb_python/main.py", line 126, in parse_ref ref_lamanno( File "/data/leuven/329/vsc32990/miniconda3/envs/velocity/lib/python3.9/site-packages/kb_python/ref.py", line 591, in ref_lamanno cdna_fasta_path = generate_cdna_fasta( File "/data/leuven/329/vsc32990/miniconda3/envs/velocity/lib/python3.9/site-packages/kb_python/fasta.py", line 335, in generate_cdna_fasta transcript_id = gtf_entry['group']['transcript_id']

Lioscro commented 3 years ago

Hi - i'm also having a similar issue. I tried this first on my own gtf and then on the standard mouse gtf and the error is the same:

No sequences will be generated for these chromosomes. [2021-04-30 17:14:50,562] ERROR An exception occurred Traceback (most recent call last): File "/data/leuven/329/vsc32990/miniconda3/envs/velocity/lib/python3.9/site-packages/kb_python/main.py", line 876, in main COMMAND_TO_FUNCTION[args.command](parser, args, temp_dir=temp_dir) File "/data/leuven/329/vsc32990/miniconda3/envs/velocity/lib/python3.9/site-packages/kb_python/main.py", line 126, in parse_ref ref_lamanno( File "/data/leuven/329/vsc32990/miniconda3/envs/velocity/lib/python3.9/site-packages/kb_python/ref.py", line 591, in ref_lamanno cdna_fasta_path = generate_cdna_fasta( File "/data/leuven/329/vsc32990/miniconda3/envs/velocity/lib/python3.9/site-packages/kb_python/fasta.py", line 335, in generate_cdna_fasta transcript_id = gtf_entry['group']['transcript_id']

Hi, @Ruth-hals, it seems your error is coming from parsing the GTF (a bit hard to tell exactly because the error you posted is truncated). This is probably not related to the original issue, as that one is crashing while running the kallisto index stage, after successfully parsing the GTF. Could you open a new issue for this with the full command and output? We can discuss more there.

Lioscro commented 3 years ago

Hi, again @MikeAskenase, I've had some time to run this on my side, and everything seems to be working on my end. Could you tell me how much memory is available on the system?

MikeAskenase commented 3 years ago

Hi @Lioscro I have been thinking about this and wondered if the CPU allocation was the issue. I am running this on a high performance cluster and requested 1 CPU and 119 GB of RAM. Do I need to request multiple CPUs? If so, do you have recommended numbers of CPUs and an amount of RAM to request per CPU?

Lioscro commented 3 years ago

I doubt that is the issue because ref is not parallelized at all. I looked up the meaning if SIGBUS, and it seems it is trying to access memory that it can not access. https://en.wikipedia.org/wiki/Bus_error

But 119GB of RAM should be more than enough to process this data.

github-actions[bot] commented 3 years ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days

amcdavid commented 2 years ago

FWIW, I also ran into this issue, and it was because my job was getting killed on the cluster due to exceeding memory. The velocity indices without the tricks that the Linnarson lab used (keeping only a window upstream from the poly-A) are real memory pigs to generate and use.

Lioscro commented 2 years ago

@amcdavid, Yes, indeed generating the full velocity index, for mouse, takes around 64GB memory. We've found that this can be cut down by about a half by using the following optional arguments to kb ref.

kb ref ... \
--include-attribute gene_biotype:protein_coding \
--include-attribute gene_biotype:lincRNA \
--include-attribute gene_biotype:antisense \
--include-attribute gene_biotype:IG_LV_gene \
--include-attribute gene_biotype:IG_V_gene \
--include-attribute gene_biotype:IG_V_pseudogene \
--include-attribute gene_biotype:IG_D_gene \
--include-attribute gene_biotype:IG_J_gene \
--include-attribute gene_biotype:IG_J_pseudogene \
--include-attribute gene_biotype:IG_C_gene \
--include-attribute gene_biotype:IG_C_pseudogene \
--include-attribute gene_biotype:TR_V_gene \
--include-attribute gene_biotype:TR_V_pseudogene \
--include-attribute gene_biotype:TR_D_gene \
--include-attribute gene_biotype:TR_J_gene \
--include-attribute gene_biotype:TR_J_pseudogene \
--include-attribute gene_biotype:TR_C_gene

These are the same arguments provided to generate the STAR genome index within Cellranger (https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/references).