shaunpwilkinson / insect

Informatic Sequence Classification Trees
14 stars 5 forks source link

memory efficency #2

Open Andreas-Bio opened 6 years ago

Andreas-Bio commented 6 years ago

Hello, is it possible to optimize the memory efficency? I have a 60GB RAM server but the training failes because the RAM is full after the message "Deriving top level model" "cannot allocate vector of size 4.9GB". It's a plant ITS dataset. I used "autodetect" as core parameter.

151000 sequences, mean length 600

Thanks!

shaunpwilkinson commented 6 years ago

It sounds like a memory overflow issue caused by an over-sized alignment, likely a combination of the long sequence length, the high variation between sequences (causing extra sparsity/gappyness) and the large number of training sequences. Is this classifier going to be used to identify NGS short reads? If so it would pay to trim the training sequences down using a virtual PCR (insect::virtualPCR) with the primers used in the run. Another option that may help is to pass a profile HMM to the learn function via the model argument. You could create a small (e.g. n = 10,000) random subset of your training sequences and derive a profile HMM using model <- aphid::derivePHMM(xsub, cores = "autodetect"). The classifier can then be trained by running tree <- insect::learn(x, db = db, model = model, cores = "autodetect"). Otherwise you could run tree <- insect::learn(x, db = db, refine = "BaumWelch", cores = "autodetect", recursive = FALSE) to learn the first node without generating an alignment, and then tree <- expand(tree, x, refine = "Viterbi", cores = "autodetect", recursive = TRUE) to continue learning the classifier using the Viterbi training (alignment-dependent) method.

Andreas-Bio commented 6 years ago

Thank you for your quick reply. The sequences are already trimmed and it is indeed intended ffor NGS, but for long NGS reads (PacBio). The primers and flanking sequences are already removed. That subset idea sounds really good, I'll try that. Will report back. I tried using one core only, but that doesnt help either.

update1: I used only 1 sequence per species for training. looking good so far: clipboard01

update2: RAM was fine this time but another error popped up: clipboard01

shaunpwilkinson commented 6 years ago

That's a new error message for me, I'm not sure what is causing that. Would you be open to sharing your training dataset so I could have a go at isolating the problem? If you create a public dropbox link to the file I could download it onto my server and take a look.

Also , I've done some more optimizing over the last couple of weeks, namely placing a few gc() and rm() calls in the alignment and HMM derivation functions which have helped a bit. Are you using the development version of the package? If not it might help to download the latest version of GitHub and try that. Cheers, Shaun

shaunpwilkinson commented 6 years ago

It looks like it is still an issue with over-sized alignments. After each Viterbi training iteration aphid::train digests the alignment using openssl::md5 to check if it is identical to the previous alignment (in which case the iterations terminate and a profile HMM is returned for the final alignment). Running openssl::md5(raw(1E10)) produces the same error you received, so looks like the Viterbi training is just generating huge gappy alignments, likely to do with the large variation between sequences. I'll have a play around with it, but I suspect the best solution will be to derive a model from a small subset of the sequences and remove any from the full set that don't score highly against the model (hopefully there will only be a handful). Would that be an option for you? I'm more than happy to help with this.

Andreas-Bio commented 6 years ago

Thanks for all the help, just some thoughts: I am a Markov novice and I am still unsure how Viterbi builds the alignments internally, but in the case of ITS the most ideal thing (I guess) would be to build the alignments using the taxonomic information available, no? Like first building alignments on family level and then derive HMMs from that and then somehow align the HMM to get a class level alignment and so forth? That would prevent overinflation because sequences from different familys will not be aligned simultaneously. It will probably lead to a loss of information on family level and above, but the identification chance using barcodes, above family level should be nigh perfect anyway. That would also reduce the memory requirements dramatically, making it possible to run almost every kind of dataset on a good desktop computer. I was also thinking about the option to use the MAFFT alignment algorithm internally whenever possible, because it has some very handy options that could be passed to the function. Like the --unalignlevel or -leavegappyregions or any other useful parameter. If it is possible to use the MAFFT alignment internally, then it is only one step further to the jackpot: http://guidance.tau.ac.il/ver2/fig_method.server-1.png At this step a GUIDANCE2 score can be created that bootstraps the alignment and assigns confidence values to each alignment position that can be used to more reliably train the model. Then finally, if one can be sure that the gaps are well-positioned in a meaningful way, it would be great to have some kind of gap coding option. https://www.researchgate.net/profile/Helga_Ochoterena2/publication/11261221/figure/fig1/AS:601664843173912@1520459485443/Example-of-simple-gap-coding-a-Alignment-Circled-no-1-gap-1-from-positions-3-to-9_W640.jpg

Making alignments in barcoding is a double-edged sword kind of. If one makes an alignment one automatically makes assumptions about the phylogenetic background of the sequences (homology through ancestry), which is not really something a barcoder want to mess with. On the other hand a meaningful aligment has the potential to yield much more information that was hidden before. Ultimately justifying to align highly divergent sequences with a lot of indels is easier if one can prove that the information gained from the alignment has been maximized. The worst thing that could happen is that a complicated set of sequences gets aligned in the wrong way, because the alignment algorithm is too ancient or primitive. I have seen phylogenetic papers out there where lycophytes have become a sister taxon to confiers, just because some genious used CLUSTAL to align ITS sequences across multiple classes.

PS: What would you consider the main advantage of insect over Metaxa2?

shaunpwilkinson commented 5 years ago

Thanks for your post and links, that's a really good idea to incorporate the MAFFT algorithm, I'll look further into it to see if we can reduce the memory usage a bit that way.

The accuracy of the alignments is not as important as for phylogenetic applications; they are more a means to an end, in order to derive profile HMMs that can be used to assign sequences to taxa. In fact the classifier can be trained without internal alignments if using the Baum Welch algorithm to train the models - although it's a bit slower that way. Having said that, improving the speed and accuracy of alignments for the default Viterbi training method would certainly help!

To answer your last question, I'm pitching this as a probabilistic classification method, rather than a nearest neighbor search or a k-mer based heuristic. I've been doing some benchmarking using the RDP and Warcup training sets and found the false discovery rate to be several times lower than the RDP naive Bayesian classifier and other similar methods. I haven't tested against Metaxa2 yet, but I'll aim to include that comparison too.

Thanks again! Shaun