shaunpwilkinson / insect

Informatic Sequence Classification Trees
14 stars 5 forks source link

Database request ITS2 fungi+plant #1

Open Andreas-Bio opened 6 years ago

Andreas-Bio commented 6 years ago

Can you maybe do a training for the ITS2 database that is commonly used by a lot of folks? It has ~370000 sequences and some working groups lack the computational resources. That would be great. I am trying to get things running to test the pipeline with my matK datbase, can make that available if it works.

Could you please clarify how the learning step is done? Is something being alinged interally? I am a little bit concerned about the "globally alignable" criterion, because for ITS (in plants for example) that is not true. It has so many spontaneous (not dervied from an ancestor) indels that, if you zoom out far enough (to taxonomic level of order maybe) it violates the homology criterion very obviously, which is the basis of every alignment. It is alignable within one family (maybe even a few), but if you try to align ITS across the whole kingdom of plants the alignment gets >10000bp long and has >90% gaps. So is this package maybe not suitable for ITS (if it based on some kind of internal alignment)?

Thanks!

PS: I see you also noticed some GenBank queries give unreproduceable errors. I usually spilt my query (using R package rentrez functions) up to a lot of fast queries (like 2000 sequeces per query) and wrap it up like this: There is probably a better way to do this, but it works. If you download hundreds of thousands of sequences on a weekly basis this is really a life saver.

repeat_on_fail <- function(fun, max_attempt = 3, ...)
{
  r <- NULL
  attempt <- 1
  while( is.null(r) && attempt <= max_attempt ) 
   {
    attempt <- attempt + 1
    try( r <- fun(...) )
   }
  r
}
shaunpwilkinson commented 6 years ago

Hi @andzandz11 , Thanks very much for your post. I'm more than happy to train some ITS2 classifiers, do you have any particular primer sets in mind?

To train the classifier, the learn function first derives a profile hidden Markov model from the training set and then proceeds to split the training set recursively, training a new profile HMM on each subset. The function uses the Viterbi training method by default to retrain the models at each split, which aligns the sequences using the parent profile HMM as a guide, derives a new profile HMM from the alignment, realigns the sequences using the new profile HMM as a guide, and continues this alignment-derivation recursively until the alignments don't change between iterations. You're absolutely right, aligning ITS2 sequence across the tree of life doesn't really work for phylogenetics etc, but seems to work great for this application. If you are concerned about the size/accuracy of the alignments there is also an option to set refine = BaumWelch in the learn function, but this is considerably slower and in my initial testing doesn't seem to have a major effect on improving accuracy (I do need to test this more though).

Thanks for your code snippet - I'll give that a try! Cheers, Shaun

Andreas-Bio commented 6 years ago

All ITS2 primer sets amplify the whole ITS2 region if that is what you mean? Usually when constructing an ITS2 database you take all ITS2 sequences you can find and retain either validated complete sequences or sequences containing conserved overhangs. The overhangs can be removed with ITSx (HMMER), but usually it is not important, because they are so conserved that they dont play a role in identification. There is also a big chunk of sequences with unclear status, but if they have a predefined minimum length I keep them, even if a few base pairs are missing on both ends. more information on ITS primers here: https://www.ncbi.nlm.nih.gov/pubmed/26084789 here is an example of a commonly used ITS2 database: http://its2.bioapps.biozentrum.uni-wuerzburg.de/

shaunpwilkinson commented 6 years ago

I usually leave the conserved ends on since they often play an important role in the first few recursive decisions before the noisy non-coding region kicks in towards the tips of the classification tree. I find having both the fast- and slow-evolving regions generally gives better discriminatory power overall. One option is to just include a handful of residues either side of the spacer, in order to capture as many training sequences as possible (i.e. sequences generated with a wide range of different primer sets). I'll make it a priority to generate a plant ITS2 classifier as soon as I get a chance. Thanks for those links too, will check them out.