gobics / uproc

Tools for ultra-fast protein sequence classification.
http://uproc.gobics.de/
GNU Lesser General Public License v3.0
5 stars 3 forks source link

Building UProC database for SEED #17

Closed mdehollander closed 8 years ago

mdehollander commented 8 years ago

Hi,

I try run uproc-makedb but it failes:

uproc-makedb /data/db/uproc/model/ /data/db/SEED/ProblemSets.2015.11/all.faa /data/db/uproc/seed/                                                                                   
fwd.ecurve: [                    ]   0.0%uproc_idmap_family() [idmap.c:70]: error building ecurves: idmap exhausted: no such object

What is the correct command to build a custom database?

rmartinjak commented 8 years ago

uproc-makedb uses the FASTA header line (up to the first whitespace character) as identifier for the protein family. Currently, we use a 16 bit integer to represent families, so only 65535 families are possible (for example, PFAM 28.0 has 16230). SEED/FIGfams seems to have more than 100k families. I'm not sure whether UProC is still suitable for such a fine-grained distinction of families, I'll discuss it with our group.

If you want to try it, edit the file libuproc/include/uproc/common.h and replace all occurrences of 16 with 32 and recompile (loading old imported databases will no longer work).

mdehollander commented 8 years ago

Hi,

Thanks for the information. Please let me know if you think UProC is suitable to use with the SEED/FIGfams database. I would like to first use KO/Pfam on our data, and then run the unclassified sequences against SEED. As far as I can see in the SEED ProblemSets.2015.11 release, there are 23,393 families, and contain 95,108,372 (!)sequences. This is different from the information given on the FIGfams website (http://www.nmpdr.org/FIG/wiki/view.cgi/FIG/FIGfamDescription#What_Are_the_FIGfams), but that one hasn't been updated since 2008... I was also thinking of TIGRFAMs, although that dataset is much smaller (4,424 families, 55,349 sequences) Would be nice to hear your (group) thoughts on this

rmartinjak commented 8 years ago

23,393 families should be doable. Maybe the way they are represented in the FASTA file is problematic, make sure every sequence that belongs to the same family has the same FASTA header. A large number of sequences should not cause uproc-makedb to fail.

mdehollander commented 8 years ago

Indeed, the problem is de formatting of the fasta file. I will rewrite the header and try again.