UniRef30_2020_03_hhm.ffdata cannot be unpacked.

sarahshah commented 4 years ago

I was trying to modify the UniRef30 database (from the file UniRef30_2020_03_hhsuite.tar.gz at http://wwwuser.gwdg.de/~compbiol/uniclust/2020_03/) to remove all non-transposable elements. I was unpacking the UniRef30_2020_03_hhm.ffdata so that I could get the file names that contained sequences that I was interested in, but I encountered an error:

ffindex_unpack UniRef30_2020_03_hhm.ffdata UniRef30_2020_03_hhm.ffindex extractedh hm/ .

The standard output showed "Segmentation fault" and the unpacking process was stopped after only extracting two files (100000051, 100000011). I was doing this from a PBS-based cluster, and I even used a node with 500GB memory, but got the same result.

I was wondering if the the file UniRef30_2020_03_hhsuite.tar.gz at http://wwwuser.gwdg.de/~compbiol/uniclust/2020_03/ contained an incomplete _hhm.ffdata file?

milot-mirdita commented 4 years ago

I tried to reproduce the issue but couldn't. I stopped the process after a few thousand files extracted.

I would however recommend to not remove the transposable elements in this way. The hhm db only contains models for large alignments as computing small ones on the fly saves a lot of disk space. If a model is not found in the hhm db it will be recomputed on the fly from the a3m.

I would recommend to make a list of all accession you want to remove and then based on the mapping file (http://wwwuser.gwdg.de/~compbiol/uniclust/2020_03/uniref_mapping.tsv.gz). Find all database identifiers you want to remove. Then you can remove only the lines containing these identifiers in the first column in the _cs219.ffindex file. If they are not found there, the entries will be invisible to HHblits.

sarahshah commented 4 years ago

Thank you, your method worked.

soedinglab / hh-suite

UniRef30_2020_03_hhm.ffdata cannot be unpacked. #203