rdpstaff / RDPTools

Collection of commonly used RDP Tools for easy building
49 stars 52 forks source link

RDP training using custom trainset #20

Open najoshi opened 6 years ago

najoshi commented 6 years ago

So I am trying to use the "train" subcommand of classifier.jar to create a new DB out of custom data. The problem is that the taxonomy data that I have is in qiime format. I want to convert the qiime format to RDP format, but I can't find any good documentation on the rdp taxonomy format. I can certainly write some code to do it, but I need to know the details of the format. So, I have my qiime formatted file which has lines that look like this:

AcrMAj74N231 k_Animalia; p_Nematoda; c_Chromadorea; o_Rhabditida; f_Cephalobidae; g_Acrobeles; s_maeneeneus

And RDP needs lines that look like this (I got this from https://sourceforge.net/projects/rdp-classifier/):

7*Acidimicrobiaceae*6*7*family

The only thing I've found is that the lines are in this format:

taxid*taxon name*parent taxid*depth*rank

However, there are some problems with the sample files that I have.... for example the taxids are not correct for the taxon... e.g. the taxid for Acidimicrobiaceae is 84994, not 7. And then, I have no idea what "depth" means... how do I calculate that?

Any help would be highly appreciated. Thanks!