Open mortonjt opened 3 years ago
Alright as promised, here is the solution for turning these function tables into microbe x gene counts
from scipy.sparse import coo_matrix
import pandas as pd
import numpy as np
def get_microbe_gene_table(func_table):
""" Obtain a genes per microbe table.
Parameters
----------
func_table : path
Stratified biom table output from woltka
Returns
-------
Table of microbes by gene counts
"""
func_ids = func_table.ids(axis='observation')
func_df = pd.DataFrame(list(map(lambda x: x.split('|'), func_ids)))
# convert to sparse matrix for convenience
func_df.columns = ['OGU', 'KEGG']
func_df['count'] = 1
ogus = list(set(func_df['OGU']))
ogu_lookup = pd.Series(np.arange(0, len(ogus)), ogus)
keggs = list(set(func_df['KEGG']))
kegg_lookup = pd.Series(np.arange(0, len(keggs)), keggs)
func_df['OGU_id'] = func_df['OGU'].apply(lambda x: ogu_lookup.loc[x]).astype(np.int64)
func_df['KEGG_id'] = func_df['KEGG'].apply(lambda x: kegg_lookup.loc[x]).astype(np.int64)
c, i, j = func_df['count'].values, func_df['OGU_id'].values, func_df['KEGG_id'].values
data = coo_matrix((c, (i, j)))
# pandas conversion optional. Can convert to biom if needed
ko_ogu = pd.DataFrame(data.todense(), index=ogus, columns=keggs)
return ko_ogu
@mortonjt Thank you for sharing thoughts and code! Will be great if you can clarify what is a microbe x gene counts table? By reading your code I have the following impression. Is my understanding correct?
Before:
FeatureID | S01 | S02 | S03 |
---|---|---|---|
Ecoli|K0123 | 2 | 0 | 5 |
Ecoli|K0456 | 13 | 7 | 4 |
Strep|K0123 | 0 | 3 | 8 |
After:
OGU | KEGG | S01 | S02 | S03 |
---|---|---|---|---|
Ecoli | K0123 | 2 | 0 | 5 |
Ecoli | K0456 | 13 | 7 | 4 |
Strep | K0123 | 0 | 3 | 8 |
Also pinging @droush because this question may be relevant.
Not quite, the code that I provided loses the sample information - it also keeps track of gene copy number per microbe So it'll look like something like
OGU | K0123 | K0456 | K0123 |
---|---|---|---|
Ecoli | 1 | 0 | 0 |
Strep | 0 | 1 | 0 |
Cdiff | 0 | 0 | 1 |
Another useful format would be a sparse COO format, where the output would be 4 columns like
OGU | KEGG | Sample | Counts |
---|---|---|---|
Ecoli | K0123 | S01 | 2 |
Ecoli | K0123 | S03 | 5 |
Ecoli | K0456 | S01 | 13 |
Ecoli | K0456 | S02 | 7 |
Ecoli | K0456 | S03 | 4 |
Strep | K0123 | S02 | 4 |
Ecoli | K0123 | S03 | 8 |
@mortonjt Very good idea! I edited your response a bit to make the tables rendering correctly.
Right now, the biom table OGU ids consist of both taxa and KEGG ids. It would be nice if there were convenience functions to allow for conversions to gene tables or tensors -- it is nontrivial to implement this from scratch.
I'm pretty close to a working solution, will post on this thread shortly