mattb112885 / clusterDbAnalysis

ITEP - Integrated Toolkit for Exploration of microbial Pan-genomes
26 stars 15 forks source link

Change file input and output from sanatised organism name to (sanatized) organism ID #48

Open JamesRH opened 11 years ago

JamesRH commented 11 years ago

Besides the changes we discussed in replaceOrgWithAbbrev.py, other files use organism names in their output or input.

in src/makeCoreClusterAnalysisTree.py, the input and output use sanitized organism names: "The input MUST be a Newick file with organism IDs REPLACED with their names" "WARNING: Organism name %s in the database was not found in the provided tree. It will be deleted!!\n" %(collist[ii]))

The description and the header comment in this file conflict about the function of the script: src/db_getBlastResultsBetweenSpecificGenes.py description = "Given list of genes to match, returns a list of BLAST results between genes in the list only"

Provide a list of organisms to match [can match any portion of the organism so if you give it just "mazei" it will return to you a list of Methanosarcina mazei]

I think this is from duplication between thses scripts: src/db_getBlastResultsBetweenSpecificGenes.py src/db_getBlastResultsBetweenSpecificOrganisms.py

Other scripts to check if the organism name or ID are used: db_findClustersByOrganismList.py db_getOrganismsInClusterRun.py db_getOrganismsInCluster.py db_addOrganismNameToTable.py db_bidirectionalBestHits.py db_TBlastN_wrapper.py

We discussed keeping the library functions, but another way to find the dependences is to see what called these library functions: lib/TreeFuncs.py: '''Parse a node name into an organism ID. lib/ClusterFuncs.py: Given an organism name, return the ID for that organism name. lib/CoreGeneFunctions.py: The return object is a list of (runid, clusterid, organism) tuples sorted by run ID then by cluster ID.''' lib/CoreGeneFunctions.py:def findGenesByOrganismList(orglist lib/CoreGeneFunctions.py: The organisms in "orglist" are considered the "ingroup"

JamesRH commented 11 years ago

Currently db_getOrganismsInClusterRun.py and db_getOrganismsInCluster.py return unsanitized organism names.

mattb112885 commented 11 years ago

This is a mess but I'll take this suggestion based on our discussion. Then we will just have one script that converts to user-readable IDs at the end, correct? (Also, when we're building figures we should have the option to do that automatically since they're made to be looked at and not computed on)

JamesRH commented 11 years ago

That sounds good to me. I'll commit my multi-format parser /lib/ function if is not already in the push request.

James H

On 05/29/2013 02:52 PM, mattb112885 wrote:

This is a mess but I'll take this suggestion based on our discussion. Then we will just have one script that converts to user-readable IDs at the end, correct? (Also, when we're building figures we should have the option to do that automatically since they're made to be looked at and not computed on)

— Reply to this email directly or view it on GitHub https://github.com/mattb112885/clusterDbAnalysis/issues/48#issuecomment-18641656.