shaunpwilkinson / insect

Informatic Sequence Classification Trees
14 stars 4 forks source link

learn, error with 18s data #3

Closed ngeraldi closed 5 years ago

ngeraldi commented 6 years ago

I am trying to run the learn function with a custom 18s database (Silva.v132 trimmed with Euka02 primer, from Taberlet et al 2018). I have put the reference data, the error and R version info below.

More info on database creation is on my github page https://github.com/ngeraldi/insect-pipeline/blob/master/insect_ref_lib_18s.R

Any help would be much appreciated and let me know if you need any additional info. Thank you Nathan nathan.geraldi@kaust.edu.sa

Database is here : https://www.dropbox.com/s/q0rm6rcs44fa9h2/SILVA_132_SSURef_Nr99_tax_silva_trunc_DNA_insect_trim_and_format_euka02.fasta?dl=0

Code, error, version: x<-insect::readFASTA(file ="SILVA_132_SSURef_Nr99_tax_silva_trunc_DNA_insect_trim_and_format_euka02.fasta") db<-insect::taxonomy(db = "NCBI", synonyms = F) # download ncbi taxonomy database

euka02_learn<- learn(x, db, model = NULL, refine = "Viterbi", iterations = 50, nstart = 20, minK = 2, maxK = 2, minscore = 0.9, probs = 0.5, retry = TRUE, resize = TRUE, maxsize = max(sapply(x, length)), recursive = TRUE, cores = "autodetect", quiet = F)

R console output: Training classifier Converting taxon IDs to full lineage strings Initializing tree object Deriving sequence weights Making hash key for exact sequence matching Dereplicating sequences Deriving top level model Error in dimnames(x) <- dn : length of 'dimnames' [2] not equal to array extent

version() platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 4.3
year 2017
month 11
day 30
svn rev 73796
language R
version.string R version 3.4.3 (2017-11-30) nickname Kite-Eating Tree

shaunpwilkinson commented 6 years ago

Hi Nathan, Thanks for the post. It looks like that fasta file contains no sequence information - only the header lines. Your code in the database curation file looks fine to me, so I'm not sure why the sequences have been omitted. Did the virtualPCR function run okay (i.e. did it print something like "retained 70960 sequences after reverse trim")? Cheers, Shaun

ngeraldi commented 6 years ago

Hi Shaun, thanks for the very quick reply and sorry for not checking the basics. I actually lost the sequences when removing those with taxID not in db (using row.names to add column to index list of sequences are characters not numbers:( ). Running now, thanks for this package!
Will let you know how it goes. I just save output of learn() with writeRDS() to use in classify?

Best Nathan

shaunpwilkinson commented 6 years ago

Hi Nathan, No problem at all, glad you got it working. Yes the classifier can be quite slow to run so it would pay to run either saveRDS(euka02_learn, "euka02_learn.rds") or save(euka02_learn, "euka02_learn.RData") to back up the tree once it has finished. For the classify function the tree just needs to be loaded in memory, so you can use either readRDS or load depending on how you saved it (I would generally recommend using saveRDS and readRDS since the files are smaller and load faster). Cheers, Shaun