mdshw5 / pyfaidx

Efficient pythonic random access to fasta subsequences
https://pypi.python.org/pypi/pyfaidx
Other
449 stars 75 forks source link

--transform nucleotide throws KeyError #221

Closed kubu4 closed 1 month ago

kubu4 commented 1 month ago

I'm using PyFaidx CLI to extract a region from a genome FastA file.

Then, I'm attempting to use the PyFaidx --transform nucleotide command to get some stats, but I get a KeyErro when it attempts to process the FastA.

I'm running the following:

${pyfaidx} "${data_dir}/${genome_fasta}" "${region}" | tee "${output_top}/${c1q_buffer_fasta}"

${pyfaidx} --transform nucleotide "${output_top}/${c1q_buffer_fasta}"

This results in the following output/error:

name    start   end A   T   C   G   N   others
Traceback (most recent call last):
  File "/home/shared/pyfaidx-0.8.1.1", line 8, in <module>
    sys.exit(main())
  File "/home/sam/.local/lib/python3.10/site-packages/pyfaidx/cli.py", line 202, in main
    write_sequence(args)
  File "/home/sam/.local/lib/python3.10/site-packages/pyfaidx/cli.py", line 51, in write_sequence
    outfile.write(transform_sequence(args, fasta, name, start, end))
  File "/home/sam/.local/lib/python3.10/site-packages/pyfaidx/cli.py", line 121, in transform_sequence
    line_len = fasta.faidx.index[name].lenc
KeyError: 'NC_052339.1'

Here is the head of the FastA produced by PyFaidx ("${output_top}/${c1q_buffer_fasta}"):

>NC_052339.1:19241056-19251513
TGGGGCTCGCCAGGACAAGGGGAGCCCTGAAAAGATGTGAAGTTACACCCCAGCTACCCAGTCACCGTCTTTCACTATCC
GCAACCCTCCACTGGGACAATCGTCAACAATACATTCAAGCCTTCATGTACTCCGGGGCCGCGAATAATTTCAGGGACCA
AGACTTCGCCAGCGAATTACAAATCCCCTGCGTGAAATGTCCCATCCCTCTCCATCTGCTCCGGCCAGTTGGGGTATCAG
ATCAAGCCCATTCTACTGCAGGTTGGGGTGAACCACTCAGACTCTTAGTTTTCTTTTGATCACTGCTCCCGAGAACCTGC
TTATCCTAGGGTACCCCTGGCTAGTGCTACATAACCCACAATTTTTCTGGTCCACGGGACACATGCTAGACTGGGGTTGT
TCAACATATGACATTAAAATGACAGATCGCTATAATAGTTGTACActcattaaaggcccagtgcagtcaacgttatgtga
tcagtgttatttcctgatagttgctggttgaaaatacaatctacactgGACCTTTTAATCAGCGGGTTTGCATGGGTGGG
AGTTTTGGCTTTCCATgatgacatcaccatgcggtaaattgattaatagaccaataacagagttccaaacttctttgcCA
ATAACAGCAATTTTTCAGTTTTTCCCTCCCCACtcaaaccactcccagacagtccttgcAAAATTCTTGTTTgtgaatgt

Here's the corresponding FastA index file:

NC_052339.1:19241056-19251513   10458   31  80  81

I have minimal experience with Python, so am not entirely sure where to begin troubleshooting this. Any help/insight would be greatly appreciated. Thanks.

EDITED: Removed unnecessary use of samtools.

mdshw5 commented 1 month ago

Thanks for the detailed report! I've been traveling and am just now sitting down to look at this. Based on your other issue (https://github.com/RobertsLab/resources/issues/1941) I looked for the assembly (found at https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/432/855/GCF_016432855.1_SaNama_1.0/) and downloaded the RefSeq FNA to try and reproduce this myself.

Using pyfaidx 0.8.1.1 under Ubuntu 23.04 I downloaded the assembly and ran these commands:

$ faidx GCF_016432855.1_SaNama_1.0_genomic.fna NC_052339.1 | tee buffer.fa
(output omitted)
$ faidx --transform nucleotide buffer.fa 
name    start   end     A       T       C       G       N       others
NC_052339.1     1       37339918        10586364        10549632        8085804 8112718 5400

I think this is basically what your script is doing - creating buffer files consisting of one sequence, in this case NC_052339.1, and then using the --transform flag to get the nucleotide stats. I don't see anything wrong with this, and looks like I can't reproduce the issue on my end. I'd point of that you can accomplish the same task without the buffer files though:

$ faidx GCF_016432855.1_SaNama_1.0_genomic.fna NC_052339.1 --transform nucleotide
name    start   end     A       T       C       G       N       others
NC_052339.1     1       37339918        10586364        10549632        8085804 8112718 5400

Maybe try this method in your script? For now I'm closing this as it's apparently not an issue with the pyfaidx code. I'm glad to help work on this more as well if you need.

kubu4 commented 1 month ago

Thanks so much for taking the time to look at, and respond, to this!

Also, thanks so much for this:

$ faidx GCF_016432855.1_SaNama_1.0_genomic.fna NC_052339.1 --transform nucleotide

I must've overlooked this capability!