ryanmelnyk / PyParanoid

Rapid and scalable homolog identification for bacterial genomes
MIT License
32 stars 7 forks source link

PGAP protein fastas error #13

Open adriangeerre opened 1 month ago

adriangeerre commented 1 month ago

Hi everyone,

First, thank you for such a nice tool, I am writing a comment more than an error. I am using protein fasta files produced by NCBI's PGAP annotation. In this case, the header of each protein follow the template "gnl|extdb|pgaptmp_XXXXXX" (where X are numbers) plus a variable functional description. The issue comes when running BuildGroups.py, more exactly at hash_fastas(). The error is the following:

Traceback (most recent call last):
  File "~/miniconda3/envs/PyParanoid-stable/bin/BuildGroups.py", line 584, in <module>
    main()
  File "~/miniconda3/envs/PyParanoid-stable/bin/BuildGroups.py", line 555, in main
    seqdata, desc, seq_number = hash_fastas()
  File "~/miniconda3/envs/PyParanoid-stable/bin/BuildGroups.py", line 213, in hash_fastas
    desc[str(seq.id)] = match.group(2)
AttributeError: 'NoneType' object has no attribute 'group'

The execution fails because the regular expression for the variable match produce a NoneType object (Line 237: match = re.search("(description:)(.*)",d)).

I patch the error by correcting the protein fasta header using linux comments:

cd Data/FAA/pep
ls | xargs -I {} sh -c "sed -i 's/gnl|extdb|//g' {}"

I hope it helps! :)