soedinglab / hh-suite

Remote protein homology detection suite.
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3019-7
GNU General Public License v3.0
515 stars 128 forks source link

reformat.pl missing #=GF DE *descriptor* during conversion from a3m to stockholm format #276

Closed kylemeador closed 2 years ago

kylemeador commented 2 years ago

:exclamation: Make to check out our User Guide.

Expected Behavior

The third line in the conversion file should read something like

=GF DE (description of multiple sequence alignment name)

Current Behavior

The output now is only

=GF DE

In addition, the following error prints upon execution: Use of uninitialized value $1 in printf at /hh-suite/scripts/reformat.pl line 749. This error indicates that a regex matching expression in perl found no match for the variable $1 that should occur just following the #=GF DE such as #=GF DE $1

Steps to Reproduce (for bugs)

Run hhblits with the following type of command: hhblits -d /hh-suite/databases/UniRef30_2020_02 -i /sequences/3qv0_1.fasta -ohhm /profiles/3qv0_1.hmm -oa3m /profiles/3qv0_1.a3m -hide_cons -hide_pred -hide_dssp -E 1E-06 -v 1 -cpu 1 Next I take the .a3m file output and run: /hh-suite/scripts/reformat.pl /profiles/3qv0_1.a3m /profiles/3qv0_1.sto

HH-suite Output (for bugs)

Please make sure to post the complete output of the tool you called. Please use gist.github.com. Here is the head of the .a3m file:

3qv0_1 ETQRVGDILQSELKIEKESLDSFNDFLNKYKFSLVETPGKNEAEIVRRTESGETVHVFFDVAQIAFANVNVVISKSEPAVSFELLMNLQEGSFYVDSATPYPSVDAALNQSAEAEITRELVYHGPPFSNLDEELQESLEAYLESRGVNEELASFISAYSEFKENNEYISWLEKMKKFFH UniRef100_A0A061B9H2 CYFA0S12e01882g1_1 n=1 Tax=Cyberlindnera fabianii TaxID=36022 RepID=A0A061B9H2_CYBFA -TSRLSETLKDELTHEKQNDtevpVELNSFIAQSGFEVVNTDGQALAKLQKNG-TDEVVHVFFDVNQVVnvrpaveeveveeeeefedpyenFINLNVVVEKkaDDSAVAFDVLVGPEDGSTYIENVIAYANKAEALTETADADQKRELAYNGPAFSNLDEKLQENFEQFLTSRGINEELYQFILNYGIHKENQEYIAWLEKLNKFFN UniRef100_A0A099P5X5 Uncharacterized protein n=1 Tax=Pichia kudriavzevii TaxID=4909 RepID=A0A099P5X5_PICKU -KTQLHEVITNELKFEEEDSfgldETFKTYLENNKIEIVNTDGKVLAELVKKF-NNETIHIYFDVLRITqtsyqlkqmqdqveqseylddelaeiaNADINVVIVKDSVATGFDLSLSLVDQSFSVQAITNFNNVETALSDSPEASAERDLKYSGPEYSNLAEELQEAINQYLMSRGINNELAEFILAYSGVKENNEYLDWLENLKKFTA UniRef100_A0A0D6EJW8 SPOSA6832_01929-mRNA-1:cds n=1 Tax=Sporidiobolus salmonicolor TaxID=5005 RepID=A0A0D6EJW8_SPOSA -PSALSTKLGEEIKFETENGdasaepDFLKDFKADGVWKLVDVPGSDEIVLTRTF-GNEkyvpsllppsrlsladqgdhSIRLIFSISDLDaehdvepyvdeeaadagsggvgdeSVSpseqafpveTSITITKpSGGALTIDAVAQgwsrpflalsswrspisrfvlltwltFLDGLFTINNISFYPDADVALGMTSEDDWKRQGLYMGPAFDNLDEGVQSEFEQYLEERGINSALALFIPDLAEWKEQKEYVSWLKGTKEFLE UniRef100_A0A0F8A2Q4 Uncharacterized protein n=1 Tax=Hirsutella minnesotensis 3608 TaxID=1043627 RepID=A0A0F8A2Q4_9HYPO ------MMIEEDLKAN---EqqpASIKDFKDNSPYEIHDTPGQEVVKLVRTY-NDEKITVSFSISDITnydpfnedpaleddempedamqnanqqrgvqstggarsaqtqeqmerdmeseegeeedMDEapapisLSIVVEKpGraKGALNVEATAQ--DGHIVVDNVYYYDAAVAAHGASPEGLEKRAGAYAGPPFGSLDEDLQVLLERFLEERGIDQSMAVFVPDYVDAKEQAEYTRWLSSVKGFVD

Here is the head of the reformatted .sto file (this output was made with -num as an option on reformat.pl, but it's the same without it):

STOCKHOLM 1.0

=GF DE

=GC RF ET-----QRVGDILQSE---LK-I------E----K-----------E------S------L--------------------------D-----------S-----------------------F-----N------D--------F----------L----------N--K--Y-K-F-S--LV---E----T----P---GK----NE------A---EIV-------RR---T--E---S------G--E--------------------T--VH----V---F--------------------------F-----------D-----------------------V------------------------A---------------------------------------------------------------------------------Q------------I-------------------------------------------------------------------------------A--------------------------------------------------------------------------------F--------------------------------------------------------------A------------------------------------------------------N-------------------------------V------------------------------------------N-------------------------V--------V---I--------S-----K---S------------------------------E-------P----A------------V-------------------S-----F------------------------------------E---------L----------------L-M---------------------N--------------------------L-------------------------------Q----------------------------------------------E-----------G--------S------------------------F---Y------------V---DS-----------A-T--P----------Y-P-------------S---V---D--A---------A----L--------------N--Q------S----A----------E-------------A----E---------------I----------------------------T---------R--------E----------L--V--------Y--H-----------G----P--------------------P--------F---------------------------------------------------------S------------------N-------------L-----D--E--EL--------Q-E--SL-EAY--------------L-----------------E-S-RG----------V-N---E-------ELASF------I-SA-----------YSE--F----K-----------E-----------N-----N-----EY------------IS--W-----------L---E--K---------MKK--FFH

3qv0_1 ET-----QRVGDILQSE---LK-I------E----K-----------E------S------L--------------------------D-----------S-----------------------F-----N------D--------F----------L----------N--K--Y-K-F-S--LV---E----T----P---GK----NE------A---EIV-------RR---T--E---S------G--E--------------------T--VH----V---F--------------------------F-----------D-----------------------V------------------------A---------------------------------------------------------------------------------Q------------I-------------------------------------------------------------------------------A--------------------------------------------------------------------------------F--------------------------------------------------------------A------------------------------------------------------N-------------------------------V------------------------------------------N-------------------------V--------V---I--------S-----K---S------------------------------E-------P----A------------V-------------------S-----F------------------------------------E---------L----------------L-M---------------------N--------------------------L-------------------------------Q----------------------------------------------E-----------G--------S------------------------F---Y------------V---DS-----------A-T--P----------Y-P-------------S---V---D--A---------A----L--------------N--Q------S----A----------E-------------A----E---------------I----------------------------T---------R--------E----------L--V--------Y--H-----------G----P--------------------P--------F---------------------------------------------------------S------------------N-------------L-----D--E--EL--------Q-E--SL-EAY--------------L-----------------E-S-RG----------V-N---E-------ELASF------I-SA-----------YSE--F----K-----------E-----------N-----N-----EY------------IS--W-----------L---E--K---------MKK--FFH UniRef100_A0A061B9H#2 -T-----SRLSETLKDE---LT-H------E----K-----------Q------N------Dtevp----------------------V-----------E-----------------------L-----N------S--------F----------I----------A--Q--S-G-F-E--VV---N----T----D---GQ----AL------A---KLQ-------KN---G------T------D--E--------------------V--VH----V---F--------------------------F-----------D-----------------------V------------------------N---------------------------------------------------------------------------------Q------------V-------------------------------------------------------------------------------Vnvrpaveeveveeeeefedpyen---------------------------------------------------------F--------------------------------------------------------------I------------------------------------------------------N-------------------------------L------------------------------------------N-------------------------V--------V---V--------E-----Kka-D------------------------------D-------S----A------------V-------------------A-----F------------------------------------D---------V----------------L-V---------------------G--------------------------P-------------------------------E----------------------------------------------D-----------G--------S------------------------T---Y------------I---EN-----------V-I--A----------Y-A-------------N---K---A--E---------A----L--------------T--E------T----A----------D-------------A----D---------------Q----------------------------K---------R--------E----------L--A--------Y--N-----------G----P--------------------A--------F---------------------------------------------------------S------------------N-------------L-----D--E--KL--------Q-E--NF-EQF--------------L-----------------T-S-RG----------I-N---E-------ELYQF------I-LN-----------YGI--H----K-----------E-----------N-----Q-----EY------------IA--W-----------L---E--K---------LNK--FFN UniRef100_A0A099P5X#3 -K-----TQLHEVITNE---LK-F------E----E-----------E------D------Sfgld----------------------E-----------T-----------------------F-----K------T--------Y----------L----------E--N--N-K-I-E--IV---N----T----D---GK----VL------A---ELV-------KK---F------N------N--E--------------------T--IH----I---Y--------------------------F-----------D-----------------------V------------------------L---------------------------------------------------------------------------------R------------I-------------------------------------------------------------------------------Tqtsyqlkqmqdqveqseylddelaeia-----------------------------------------------------N--------------------------------------------------------------A------------------------------------------------------D-------------------------------I------------------------------------------N-------------------------V--------V---I--------V-----K---D------------------------------S-------V----A------------T-------------------G-----F------------------------------------D---------L----------------S-L---------------------S--------------------------L-------------------------------V----------------------------------------------D-----------Q--------S------------------------F---S------------V---QA-----------I-T--N----------F-N-------------N---V---E--T---------A----L--------------S--D------S----P----------E-------------A----S---------------A----------------------------E---------R--------D----------L--K--------Y--S-----------G----P--------------------E--------Y---------------------------------------------------------S------------------N-------------L-----A--E--EL--------Q-E--AI-NQY--------------L-----------------M-S-RG----------I-N---N-------ELAEF------I-LA-----------YSG--V----K-----------E-----------N-----N-----EY------------LD--W-----------L---E--N---------LKK--FTA UniRef100_A0A0D6EJW#4 -P-----SALSTKLGEE---IK-F------E----T-----------E------N------Gdasaep--------------------D-----------F-----------------------L-----K------D--------F----------K----------A--D--G-V-W-K--LV---D----V----P---GS----DE------I---VLT-------RT---F------G------N--EkyvpsllppsrlsladqgdhS--IR----L---I--------------------------F-----------S-----------------------I------------------------S---------------------------------------------------------------------------------D------------L-------------------------------------------------------------------------------Daehdvepyvdeeaadagsggvgde--------------------------------------------------------S--------------------------------------------------------------V------------------------------------------------------Spseqafpve----------------------T------------------------------------------S-------------------------I--------T---I--------T-----Kp--S------------------------------G-------G----A------------L-------------------T-----I------------------------------------D---------A----------------V-A---------------------QgwsrpflalsswrspisrfvlltwltF-------------------------------L----------------------------------------------D-----------G--------L------------------------F---T------------I---NN-----------I-S--F----------Y-P-------------D---A---D--V---------A----L--------------G--M------T----S----------E-------------D----D---------------W----------------------------K---------R--------Q----------G--L--------Y--M-----------G----P--------------------A--------F---------------------------------------------------------D------------------N-------------L-----D--E--GV--------Q-S--EF-EQY--------------L-----------------E-E-RG----------I-N---S-------ALALF------I-PD-----------LAE--W----K-----------E-----------Q-----K-----EY------------VS--W-----------L---K--G---------TKE--FLE UniRef100_A0A0F8A2Q#5 -----------MMIEED---LK-A------N------------------------------Eqqp-----------------------A-----------S-----------------------I-----K------D--------F----------K----------D--N--S-P-Y-E--IH---D----T----P---GQ----EV------V---KLV-------RT---Y------N------D--E--------------------K--IT----V---S--------------------------F-----------S-----------------------I------------------------S---------------------------------------------------------------------------------D------------I-------------------------------------------------------------------------------Tnydpfnedpaleddempedamqnanqqrgvqstggarsaqtqeqmerdmeseegeeed----------------------M--------------------------------------------------------------D------------------------------------------------------Eapapis-------------------------L------------------------------------------S-------------------------I--------V---V--------E-----Kp--Gra----------------------------K-------G----A------------L-------------------N-----V------------------------------------E---------A----------------T-A---------------------Q---------------------------------------------------------------------------------------------------------D-----------G--------H------------------------I---V------------V---DN-----------V-Y--Y----------Y-D-------------A---A---V--A---------A----H--------------G--A------S----P----------E-------------G----L---------------E----------------------------K---------R--------A----------G--A--------Y--A-----------G----P--------------------P--------F---------------------------------------------------------G------------------S-------------L-----D--E--DL--------Q-V--LL-ERF--------------L-----------------E-E-RG----------I-D---Q-------SMAVF------V-PD-----------YVD--A----K-----------E-----------Q-----A-----EY------------TR--W-----------L---S--S---------VKG--FVD UniRef100_A0A0H5BZ4#6 -T-----SRVASTLKAE---LE-H------E----R-----------D------N------Apeaf----------------------N-----------E--------------------------------------------------------T----------S--F--A-G-F-S--VV---N----T----N---GQ----AL------G---KLE-------KD---S------S------D--E--------------------L--VH----V---F--------------------------F-----------D-----------------------V------------------------N---------------------------------------------------------------------------------Q------------I-------------------------------------------------------------------------------Vnlrsneaeeiegeeegfedpydsn--------------------------------------------------------F--------------------------------------------------------------I------------------------------------------------------N-------------------------------V------------------------------------------N-------------------------V--------V---V--------E-----Kks-D------------------------------G-------S----A------------V-------------------A-----F------------------------------------D---------V----------------L-V---------------------G--------------------------P-------------------------------E----------------------------------------------D-----------G--------S------------------------S---Y------------I---EN-----------V-T--A----------Y-A-------------D---K---T--E---------A----L--------------E--E------S----A----------E-------------A----E---------------Q----------------------------K---------R--------D----------L--R--------Y--N-----------G----P--------------------A--------F---------------------------------------------------------T------------------N-------------L-----D--E--KL--------Q-E--DF-ENY--------------L-----------------V-S-RG----------I-N---T-------DLFRF------I-VD-----------YGV--A----K-----------E-----------N-----N-----EY------------IS--W-----------L---N--K---------LNK--FFN

Context

The lack of a description breaks stockholm file processing using other tools (Biopython's AlignIO). It appears that the reformat.pl script is attempting to place a value here, but is unsuccessful. I have a feeling that the formatting of the .a3m file produced or the reformat.pl script needs to be modified to include the proper description.

Your Environment

Include as many relevant details about the environment you experienced the issue in.

milot-mirdita commented 2 years ago

reformat.pl takes the description from the first sequence in the MSA (in this case 3qv0_1). More accuratly, it takes everything after the first word and uses that as a description. E.g. test in the example below:

>3qv0_1 test

Would skipping the #=GF DE output completely help if the description field is empty?

kylemeador commented 2 years ago

Thanks for the response, makes sense. I think that is a fine solution. Especially if hhblits is spitting out the query sequence without a description.

milot-mirdita commented 2 years ago

I think this should solve the issue.