Closed mdehollander closed 8 years ago
uproc-makedb
uses the FASTA header line (up to the first whitespace character) as identifier for the protein family. Currently, we use a 16 bit integer to represent families, so only 65535 families are possible (for example, PFAM 28.0 has 16230).
SEED/FIGfams seems to have more than 100k families. I'm not sure whether UProC is still suitable for such a fine-grained distinction of families, I'll discuss it with our group.
If you want to try it, edit the file libuproc/include/uproc/common.h
and replace all occurrences of 16
with 32
and recompile (loading old imported databases will no longer work).
Hi,
Thanks for the information. Please let me know if you think UProC is suitable to use with the SEED/FIGfams database. I would like to first use KO/Pfam on our data, and then run the unclassified sequences against SEED. As far as I can see in the SEED ProblemSets.2015.11 release, there are 23,393 families, and contain 95,108,372 (!)sequences. This is different from the information given on the FIGfams website (http://www.nmpdr.org/FIG/wiki/view.cgi/FIG/FIGfamDescription#What_Are_the_FIGfams), but that one hasn't been updated since 2008... I was also thinking of TIGRFAMs, although that dataset is much smaller (4,424 families, 55,349 sequences) Would be nice to hear your (group) thoughts on this
23,393 families should be doable. Maybe the way they are represented in the FASTA file is problematic, make sure every sequence that belongs to the same family has the same FASTA header. A large number of sequences should not cause uproc-makedb
to fail.
Indeed, the problem is de formatting of the fasta file. I will rewrite the header and try again.
Hi,
I try run
uproc-makedb
but it failes:What is the correct command to build a custom database?