gjospin / PhyloSift

Phylogenetic and taxonomic analysis for genomes and metagenomes
82 stars 17 forks source link

Fix 18S marker packages & add taxonomy #135

Closed hollybik closed 12 years ago

hollybik commented 12 years ago

Pipeline for 18S data is currently broken, and we need to add the NCBI taxonomy regardless. In terms of pulling taxonomy in from NCBI for euks, the following levels are junky and uninformative for biologists and I usually use the code below to trash them in my own scripts. I suggest we prune these out to make the NCBI taxonomy more manageable in PhyloSift.

    $qiime =~ s/Fungi\/Metazoa group;//;
    $qiime =~ s/Eumetazoa;//;
    $qiime =~ s/Bilateria;//;
    $qiime =~ s/Pseudocoelomata;//;
    $qiime =~ s/Coelomata;//; 
    $qiime =~ s/Acoelomata//;
    $qiime =~ s/Protostomia;//;  
    $qiime =~ s/Deuterostomia;//;
    $qiime =~ s/Panarthropoda;//;
    $qiime =~ s/Annelida\/Echiura\/Pogonophora group;//;
    $qiime =~ s/Opisthokonta;//;
    $qiime =~ s/cellular organisms;//;        

So for something like C.elegans, NCBI taxonomy is this:

cellular organisms; Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria; Pseudocoelomata; Nematoda; Chromadorea; Rhabditida; Rhabditoidea; Rhabditidae; Peloderinae; Caenorhabditis

But this is the most useful pruned version that we'd want to output in PhyloSift:

Eukaryota; Metazoa; Nematoda; Chromadorea; Rhabditida; Rhabditoidea; Rhabditidae; Peloderinae; Caenorhabditis

The first level should be Eukaryotes, the second level should be major group, and then third level should be the informative label that you'd probably put on pie charts. The rest of the hierarchy should be kept, but will be relevant only for a subset of users (e.g. those that want to know more specific details about lower-level taxonomy)

hollybik commented 12 years ago

Guillaume has added more specific issues to address the remaining issues with 18S data. The 18S packages are running up on devel, and this will hopefully be pushed to master soon.