soedinglab / hh-suite

Remote protein homology detection suite.
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3019-7
GNU General Public License v3.0
547 stars 135 forks source link

segmentation fault in 'ffindex_from_fasta' when indexing very long proteins #262

Open apcamargo opened 3 years ago

apcamargo commented 3 years ago

I'm trying to index a 6-frame translation of a genome to use it in a hhblits search, but I get a segmentation fault in ffindex_from_fasta. This might be due to the huge length of the sequences, as I haven't had this problem before.

This is surely not a typical use case, but it could be very useful to detect domains in genomes where gene-calling is not trivial.

ffindex_from_fasta -s GCF_000008865.2_ASM886v2_6frame.{data,index} GCF_000008865.2_ASM886v2_6frame.faa

data file: GCF_000008865.2_ASM886v2_6frame.data
index file: GCF_000008865.2_ASM886v2_6frame.index
fasta file: GCF_000008865.2_ASM886v2_6frame.faa
[1]    42358 segmentation fault  ffindex_from_fasta -s GCF_000008865.2_ASM886v2_6frame.{data,index}

UPDATE: This ended up being pretty simple to fix by modifying this line. Is there any reason for this limit?

milot-mirdita commented 3 years ago

Thanks for digging through that, HH-suite does a lot of useless stack allocations like that sadly. And since it's currently completely unfunded there is none to go through them and fix them up.

I would have recommended to just use mmseqs createdb and symlink the data to .ffdata and the .index to .ffindex. Or alternatively create the .ffindex immediatly in the right order with the following command:

LC_ALL=C sort db.index > db.ffindex

Most tools should sort the .ffindex anyway on first use (iirc), this should be more efficient though.

apcamargo commented 3 years ago

Thanks for the suggestion, @milot-mirdita!

zhaoxvwahaha commented 3 years ago

@milot-mirdita , Hi milot-mirdita, I followed your advice to use mmseqs createdb, but I encounter new error, do you have any idea about this: Error in hhsuite/hh-suite-master/src/hhfunc.cpp:83: ReadQueryFile:

zhaoxvwahaha commented 3 years ago

@milot-mirdita , Hi milot-mirdita, I followed your advice to use mmseqs createdb, but I encounter new error, do you have any idea about this: Error in hhsuite/hh-suite-master/src/hhfunc.cpp:83: ReadQueryFile:

  • 09:22:49.112 ERROR: unrecognized input file format in '1'
  • 09:22:49.112 ERROR: line = YAPVNWGHGEINDSTTVEPILDGPYQPTTFTPPTDYWILINSNTNGVVYESTNNSDFWTAVIAVEPHVDPVDRQYNVFGENKQFNVRNDSDKWKFLEMFRGSSQSDFYNRRTLTSD
  • 09:22:49.114 ERROR: Error in hhsuite/hh-suite-master/src/hhfunc.cpp:83: ReadQueryFile:
  • 09:22:49.114 ERROR: unrecognized input file format in '0'
  • 09:22:49.114 ERROR: line = MNVFTSSVGSVEFDHPLLLENDLTSLSINCDDVHCSSRALCYIYDIHSSRHPSIDEHQFLRLLHGPDDAVTLGSFLKTLIWILSHDKNLPEEYRL

@martin-steinegger Hi, could you give some advice for this?

martin-steinegger commented 3 years ago
09:22:49.114 ERROR: unrecognized input file format in '0'

Hh-suite tries to figure out the input file format (A3M, HHM, FAS). Somehow it can not recognize your input. Could you check if the input is a valid formatted?

zhaoxvwahaha commented 3 years ago

@martin-steinegger my input is fasta file, which include sub sequences from nr database, I used ffindex_from_fasta, but get the following error: ffindex_from_fasta -s segment_fas.ff{data,index} segment.faa data file: segment_fas.ffdata index file: segment_fas.ffindex fasta file: combine.segment.faa Segmentation fault (core dumped) Then, I followed the advice of milot-mirdita: https://github.com/soedinglab/hh-suite/issues/262#issuecomment-831439909 but get above new errors

martin-steinegger commented 3 years ago

mmseqs createdb is not compatible with HH-suite. Have you tried to apply the fix suggested by @apcamargo to ffindex_from_fasta. I guess it crashes because there there are too long sequences in the NR.

zhaoxvwahaha commented 3 years ago

@martin-steinegger The longest sequence is 4403 in my fasta file, so maybe it's not the problem. I'm not sure, if some sequence name is not OK for HHsuite, for example several "^A" signal within the second sequence name

QLL91252.1 VP7 [Human rotavirus A] MYGIEYTTILTFLISIILLNYILKSITNIMDFIIYRFLLIFVVILPFIKAQNYGINLPIT GSMDTAYANSTQQENFMTSTLCLYYPSSVTTEITDPDWTSTLSQLFLTKGWPTNSVYFKS YADISSFSVDPQLYCDYNIVLIQYQNSLALDVSELADLILNEWLCNPMDVTLYYYQQTDE TNKWISMGESCTVKVCPLNTQTLGIGCTTTDVTTFEEVANAEKLVITDVVDGVNHKINIT VNTCTIRNCKKLGPRENVAIIQVGSSDVIDITADPTTIPQTERMMRINWKKWWQVFYTVV DYINQIVQVMSKRSRSLNSAAFYYRI ACU87638.1 VP3 [Human rotavirus A]^AAKD32050.1 VP3 [Rotavirus A]^AAKD32052.1 VP3 [Rotavirus A]^AAKD32079.1 VP3 [Rotavirus A] MKVLALRHSVAQVYADTQTYLHDDSKDEYENAFLISNLTTHNILYLNYSLKTLKILNKSG IAAVEVQSPDELFTLIRCNFTYDYENSIIYLHDYSYYTNNEIRTDQHWITKTDITDYLLP GWKLTYVGYNGKNTRGHYNFSFLCQNAATDDDIIIEYIYSNELDFQNFLLRKIKERMTTS LPIARLSNRVFRDKLFPSIVNMYEKVINVGPRNESMFTFLNFPTIKQFSNGAYIVKHTIK

martin-steinegger commented 3 years ago

I am not sure either. I would keep the header and header ids short < 31 and try rerunning it.

zhaoxvwahaha commented 3 years ago

@martin-steinegger Maybe it's not the problem of sequence name, because I could get partial segment_fas.ffindex and segment_fas.ffdata file ,it can formally recognize the name, but I have no idea why it interrupt, I could support my fasta file for you if needed.

ahcm commented 3 years ago

This code diverges far from mainline and violates the license that asks to add ones name if one distributes modified versions.

Mainline is: https://github.com/ahcm/ffindex

Unlike mainline they allocate this on the stack for whatever reason: char entry[MAX_ENTRY_LENGTH];

If you make this larger than your stacksize the executable will not start.

This is also introduces a severe security bug.