knights-lab / SHOGUN

SHallow shOtGUN profiler
GNU Affero General Public License v3.0
54 stars 19 forks source link

Question: levels on functional profiling? #17

Closed tanaes closed 6 years ago

tanaes commented 6 years ago

@bhillmann @GabeAl What exactly is the 'levels' specifying for predictive functional profiling? Often I get very sparse output tables for higher levels, e.g. genus.

bhillmann commented 6 years ago

The KO table is summarized at the lowest level. This is because the mapping from reference ids (eg. NC_0001.1) to Kegg IDs (eg. KO0001) are done using Prokka. In the rep82 database, every entry in the FASTA is a full genome from RefSeq version 82 with a strain level annotation. Each of the full reference genomes that are prokaryotes is run through PROKKA to identify which KEGG ids are within that genome. Therefore we have a mapping from taxonomy at the strain level to ko's as shown here. If your taxatable is not summarized at the strain level, then we must convert this ko-strain2ko to a higher level. This is done now by collapsing the one-to-many (a genus has potentially many strains in it) by an 80% plurality vote (each KEGG ID for a genus must belong in 80% of the strains).

The code for this collapsing is shown here, where csr is a matrix with rows being strain ids and columns being kegg ids (a very sparse matrix hence we are using a column sparse row data structure instead of a full pandas dataframe). Not that a genome may contain multiple copies of a gene or copy numbers. Therefore we also take the median gene count of all genomes within that level for the actual gene count.

def summarize_at_level(csr, names, kegg_ids, level):
    s = pd.DataFrame(sorted(names, key=names.get), columns=["names"])
    s['group'] = [";".join(_.split(';')[:level]) for _ in s['names']]
    indptr = [0]
    indices = []
    row_names = {}
    data = []
    for name, df in s.groupby("group"):
        _mat = csr[df.index, :].todense()
        # Threshold to over 80%
        _mat_thresh = np.divide((_mat > 0).sum(axis=0), _mat.shape[0]) > .8
        _mat_thresh = np.asarray(_mat_thresh).reshape(-1)
        if _mat_thresh.any():
            # Get the medians
            _medians = np.asarray(np.median(_mat[:, _mat_thresh], axis=0)).reshape(-1)
            indices.extend(np.where(_mat_thresh)[0])
            data.extend(_medians)
            row_names.setdefault(name, len(row_names))
            indptr.append(len(indices))
    return csr_matrix((data, indices, np.array(indptr)), dtype=np.int8), row_names
tanaes commented 6 years ago

So just to check my intuition then, if I have a taxonomy profile that has the following lines:

k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Actinomycetaceae;g__Actinomyces;s__Actinomyces_odontolyticus;t__Actinomyces_odontolyticus_ATCC_17982      14
k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Actinomycetaceae;g__Actinomyces;s__Actinomyces_oris       8
k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Actinomycetaceae;g__Actinomyces      1

the resulting 'strain' level functional profile will have only hits derived from the first line, the 'species' level functional profile will have only hits derived from the second line, and the 'genus' level only hits from the third line?