Open apcamargo opened 3 years ago
Thanks for digging through that, HH-suite does a lot of useless stack allocations like that sadly. And since it's currently completely unfunded there is none to go through them and fix them up.
I would have recommended to just use mmseqs createdb
and symlink the data to .ffdata
and the .index
to .ffindex
.
Or alternatively create the .ffindex
immediatly in the right order with the following command:
LC_ALL=C sort db.index > db.ffindex
Most tools should sort the .ffindex
anyway on first use (iirc), this should be more efficient though.
Thanks for the suggestion, @milot-mirdita!
@milot-mirdita , Hi milot-mirdita, I followed your advice to use mmseqs createdb, but I encounter new error, do you have any idea about this: Error in hhsuite/hh-suite-master/src/hhfunc.cpp:83: ReadQueryFile:
09:22:49.112 ERROR: unrecognized input file format in '1'
09:22:49.112 ERROR: line = YAPVNWGHGEINDSTTVEPILDGPYQPTTFTPPTDYWILINSNTNGVVYESTNNSDFWTAVIAVEPHVDPVDRQYNVFGENKQFNVRNDSDKWKFLEMFRGSSQSDFYNRRTLTSD
09:22:49.114 ERROR: Error in hhsuite/hh-suite-master/src/hhfunc.cpp:83: ReadQueryFile:
09:22:49.114 ERROR: unrecognized input file format in '0'
09:22:49.114 ERROR: line = MNVFTSSVGSVEFDHPLLLENDLTSLSINCDDVHCSSRALCYIYDIHSSRHPSIDEHQFLRLLHGPDDAVTLGSFLKTLIWILSHDKNLPEEYRL
@milot-mirdita , Hi milot-mirdita, I followed your advice to use mmseqs createdb, but I encounter new error, do you have any idea about this: Error in hhsuite/hh-suite-master/src/hhfunc.cpp:83: ReadQueryFile:
- 09:22:49.112 ERROR: unrecognized input file format in '1'
- 09:22:49.112 ERROR: line = YAPVNWGHGEINDSTTVEPILDGPYQPTTFTPPTDYWILINSNTNGVVYESTNNSDFWTAVIAVEPHVDPVDRQYNVFGENKQFNVRNDSDKWKFLEMFRGSSQSDFYNRRTLTSD
- 09:22:49.114 ERROR: Error in hhsuite/hh-suite-master/src/hhfunc.cpp:83: ReadQueryFile:
- 09:22:49.114 ERROR: unrecognized input file format in '0'
- 09:22:49.114 ERROR: line = MNVFTSSVGSVEFDHPLLLENDLTSLSINCDDVHCSSRALCYIYDIHSSRHPSIDEHQFLRLLHGPDDAVTLGSFLKTLIWILSHDKNLPEEYRL
@martin-steinegger Hi, could you give some advice for this?
09:22:49.114 ERROR: unrecognized input file format in '0'
Hh-suite tries to figure out the input file format (A3M, HHM, FAS). Somehow it can not recognize your input. Could you check if the input is a valid formatted?
@martin-steinegger my input is fasta file, which include sub sequences from nr database, I used ffindex_from_fasta, but get the following error: ffindex_from_fasta -s segment_fas.ff{data,index} segment.faa data file: segment_fas.ffdata index file: segment_fas.ffindex fasta file: combine.segment.faa Segmentation fault (core dumped) Then, I followed the advice of milot-mirdita: https://github.com/soedinglab/hh-suite/issues/262#issuecomment-831439909 but get above new errors
mmseqs createdb
is not compatible with HH-suite. Have you tried to apply the fix suggested by @apcamargo to ffindex_from_fasta
. I guess it crashes because there there are too long sequences in the NR.
@martin-steinegger The longest sequence is 4403 in my fasta file, so maybe it's not the problem. I'm not sure, if some sequence name is not OK for HHsuite, for example several "^A" signal within the second sequence name
QLL91252.1 VP7 [Human rotavirus A] MYGIEYTTILTFLISIILLNYILKSITNIMDFIIYRFLLIFVVILPFIKAQNYGINLPIT GSMDTAYANSTQQENFMTSTLCLYYPSSVTTEITDPDWTSTLSQLFLTKGWPTNSVYFKS YADISSFSVDPQLYCDYNIVLIQYQNSLALDVSELADLILNEWLCNPMDVTLYYYQQTDE TNKWISMGESCTVKVCPLNTQTLGIGCTTTDVTTFEEVANAEKLVITDVVDGVNHKINIT VNTCTIRNCKKLGPRENVAIIQVGSSDVIDITADPTTIPQTERMMRINWKKWWQVFYTVV DYINQIVQVMSKRSRSLNSAAFYYRI ACU87638.1 VP3 [Human rotavirus A]^AAKD32050.1 VP3 [Rotavirus A]^AAKD32052.1 VP3 [Rotavirus A]^AAKD32079.1 VP3 [Rotavirus A] MKVLALRHSVAQVYADTQTYLHDDSKDEYENAFLISNLTTHNILYLNYSLKTLKILNKSG IAAVEVQSPDELFTLIRCNFTYDYENSIIYLHDYSYYTNNEIRTDQHWITKTDITDYLLP GWKLTYVGYNGKNTRGHYNFSFLCQNAATDDDIIIEYIYSNELDFQNFLLRKIKERMTTS LPIARLSNRVFRDKLFPSIVNMYEKVINVGPRNESMFTFLNFPTIKQFSNGAYIVKHTIK
I am not sure either. I would keep the header and header ids short < 31 and try rerunning it.
@martin-steinegger Maybe it's not the problem of sequence name, because I could get partial segment_fas.ffindex and segment_fas.ffdata file ,it can formally recognize the name, but I have no idea why it interrupt, I could support my fasta file for you if needed.
This code diverges far from mainline and violates the license that asks to add ones name if one distributes modified versions.
Mainline is: https://github.com/ahcm/ffindex
Unlike mainline they allocate this on the stack for whatever reason: char entry[MAX_ENTRY_LENGTH];
If you make this larger than your stacksize the executable will not start.
This is also introduces a severe security bug.
I'm trying to index a 6-frame translation of a genome to use it in a hhblits search, but I get a segmentation fault in ffindex_from_fasta. This might be due to the huge length of the sequences, as I haven't had this problem before.
This is surely not a typical use case, but it could be very useful to detect domains in genomes where gene-calling is not trivial.
UPDATE: This ended up being pretty simple to fix by modifying this line. Is there any reason for this limit?