pachterlab / kb_python

A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing
https://www.kallistobus.tools/
BSD 2-Clause "Simplified" License
154 stars 23 forks source link

Potential problem with kb-python; #275

Open fernando-pierucci-alves opened 4 weeks ago

fernando-pierucci-alves commented 4 weeks ago

Describe the issue I am executing this command as per 'box 2' in Nat. Biot. 2024 article: ~$ kb ref -i fpa_index.idx -g fpa_t2g.txt -f1 fpa_cdna.fasta cds_from_genomic.fasta genomic.gtf

What is the exact command that was run?

kb count ...

Command output (with --verbose flag)

Please copy and paste *VERBOSE* output here.
Yenaled commented 4 weeks ago

?

fernando-pierucci-alves commented 4 weeks ago

Hi Yenaled,

I made a mistake in sending before completing. Again:

I am executing this command as per 'box 2' in Nat. Biot. 2024 article: ~$ kb ref -i fpa_index.idx -g fpa_t2g.txt -f1 fpa_cdna.fasta cds_from_genomic.fasta genomic.gtf

The fasta and gtf are the latest human NCBI files.

At the end of the indexing process, I am getting this output in terminal:

4. [2024-10-30 13:18:19,235] WARNING [main] GeneKIR3DP1_34has no transcripts. The entire gene will be marked as a transcript and an exon with IDKIR3DP1_34. [2024-10-30 13:18:19,235] WARNING [main] GeneKIR2DP1_36has no transcripts. The entire gene will be marked as a transcript and an exon with IDKIR2DP1_36`. [2024-10-30 13:18:23,996] INFO [ref] Splitting genome cds_from_genomic.fasta into cDNA at /home/fpa6/tmp/tmpxjhvjtxe [2024-10-30 14:53:01,887] INFO [ref] Concatenating 1 cDNAs to fpa_cdna.fasta [2024-10-30 14:53:01,887] INFO [ref] Creating transcript-to-gene mapping at fpa_t2g.txt [2024-10-30 14:53:01,888] INFO [ref] Indexing fpa_cdna.fasta to fpa_index.idx [2024-10-30 14:53:02,989] ERROR [ref] Warning: you asked for 8, but only 4 cores on the machine [build] loading fasta file fpa_cdna.fasta [build] k-mer length: 31 KmerStream::KmerStream(): Input file tmp/kallisto.XEaFUK6CL0C03cU43RFyOti does not exist, is ill-formed or is not in FASTA/FASTQ/GFA format. [2024-10-30 14:53:02,989] ERROR [main] An exception occurred Traceback (most recent call last): File "/home/fpa6/venv_fpa_kbpython/lib/python3.12/site-packages/kb_python/main.py", line 1926, in main COMMAND_TO_FUNCTION[args.command](parser, args, temp_dir=temp_dir) File "/home/fpa6/venv_fpa_kbpython/lib/python3.12/site-packages/kb_python/main.py", line 359, in parse_ref ref( File "/home/fpa6/venv_fpa_kbpython/lib/python3.12/site-packages/ngs_tools/logging.py", line 62, in inner return func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/fpa6/venv_fpa_kbpython/lib/python3.12/site-packages/kb_python/ref.py", line 689, in ref ) if n > 1 else kallisto_index( ^^^^^^^^^^^^^^^ File "/home/fpa6/venv_fpa_kbpython/lib/python3.12/site-packages/kb_python/ref.py", line 291, in kallisto_index run_executable(command) File "/home/fpa6/venv_fpa_kbpython/lib/python3.12/site-packages/kb_python/dry/init.py", line 25, in inner return func(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/fpa6/venv_fpa_kbpython/lib/python3.12/site-packages/kb_python/utils.py", line 203, in run_executable raise sp.CalledProcessError(p.returncode, ' '.join(command)) subprocess.CalledProcessError: Command '/home/fpa6/venv_fpa_kbpython/lib/python3.12/site-packages/kb_python/bins/linux/kallisto/kallisto index -i fpa_index.idx -k 31 -t 8 -d cds_from_genomic.fasta fpa_cdna.fasta' returned non-zero exit status 1.

The three output files are produced but they are empty.

Can you please provide insight? Thank you in advance and congrats on this overall scientific work.

Fernando

Yenaled commented 4 weeks ago

cds_from_genomic.fasta is not the whole genome; it's just the CDS. You need to supply a whole genome FASTA.

fernando-pierucci-alves commented 4 weeks ago

The article states...

"The default is --workflow=standard, which creates an index suitable for bulk and single-cell RNA-seq quantification. It creates an index built from only the coding DNA sequences (the usage of coding DNA here follows that of Ensembl30, i.e., the sequences of the mature transcripts wherein introns are not included as they have been spliced out)."

Am I misinterpreting the above? Thanks.

Yenaled commented 4 weeks ago

It builds an index from coding sequences; that doesn’t mean you supply the coding sequences. kb ref extracts the coding sequences for you from the genomic FASTA. That’s why the paper says “For both the standard and nac index types, a user supplies a genome FASTA and GTF annotation, which kb-python uses to extract the relevant sequences”.

fernando-pierucci-alves commented 4 weeks ago

That should solve the problem. Thanks again,

FPA