scikit-bio / scikit-bio

scikit-bio: a community-driven Python library for bioinformatics, providing versatile data structures, algorithms and educational resources.
https://scikit.bio
BSD 3-Clause "New" or "Revised" License
884 stars 268 forks source link

TreeNode.from_taxonomy should take a pandas dataframe as input #2009

Open mortonjt opened 6 months ago

mortonjt commented 6 months ago

Dataframes are easier to obtain than list of tuple of lists (which is the current input of TreeNode.from_taxonomy )

The current workaround (using the metaphlan4 database as a reference) is as follows


from skbio import TreeNode
taxonomy = pd.read_table('MetaPhlAn4/mpa_vJan21_CHOCOPhlAnSGB_202103_species.txt', header=None, index_col=0)
taxonomy = pd.DataFrame(list(taxonomy[1].apply(lambda x: x.split('|')).values), index=taxonomy.index)
taxonomy.columns = ['k', 'p', 'c', 'o', 'f', 'g', 's']
lineages = [(tup.Index, [tup.k, tup.p, tup.c, tup.o, tup.f, tup.g, tup.s]) for tup in taxonomy.itertuples(index=True)]
tree = TreeNode.from_taxonomy(lineages)
wasade commented 6 months ago

It would be pleasant for from_taxonomy to detect the separator used, and to split accordingly as well.