micom-dev / micom

Python package to study microbial communities using metabolic modeling.
https://micom-dev.github.io/micom
Apache License 2.0
94 stars 18 forks source link

Possible Bug in `com.knockout_taxa()` #172

Open alejandrocs98 opened 7 months ago

alejandrocs98 commented 7 months ago

Hi! I really like MICOM! But I am having a problem when trying to perform the Taxa Knockout Analysis

Problem description

I want to calculate the log fold change of the growth rates when performing the Taxa Knockouts. Currently, one can obtain one of three possible outputs when using com.knockout_taxa().

  1. using method='raw' -> new: the new growth rates for the taxa in the community without the knocked taxon.
  2. using method=change` -> new - old: the absolute change in the growth rates for the new community without the knocked taxon with respect to the original community.
  3. using method=relative change` -> (new - old)/old: the relative change in the growth rates for the new community without the knocked taxon with respect to the original community.

What I actually want for the log fold change is something like log(new) - log(old) (or log(new/old), which is the same). So I attempted two things for trying to achieve this:

  1. Run com.knockout_taxa() with method equal to raw and change, and then calculate old (the growth rates for the original community) as follows old = change - new (now that I know both new and old I can easily calculate the log fold change using either of the expressions above). The problem is that, when I checked the values for old for all my samples, I consistently found some troubling values for some of the knocked taxa (always in one of the first three taxa in my matrix).
com = load_pickle('mouse_01'.pickle)
diet = load_qiime_medium("vmh_high_fiber.qza")
com.medium = diet.flux
ko_raw = com.knockout_taxa(fraction=0.6, method='raw', diag=False)
ko_change = com.knockout_taxa(fraction=0.6, method='change', diag=False)
old = ko_raw - ko_change
print(old)
                        B_caccae  B_cellulosilyticus_WH2  B_ovatus  \
B_caccae                     NaN                0.045237 -0.023992   
B_cellulosilyticus_WH2  0.000556                     NaN  0.000966   
B_ovatus                0.000556                0.037356       NaN   
B_thetaiotaomicron      0.000556                0.037356  0.000986   
B_uniformis             0.000556                0.037356  0.000986   
B_vulgatus              0.000556                0.037356  0.000986   
C_aerofaciens           0.000556                0.037356  0.000986   
C_scindens              0.000556                0.037356  0.000986   
C_spiroforme            0.000556                0.037356  0.000986   
P_distasonis            0.000556                0.037356  0.000986   
R_obeum                 0.000556                0.037356  0.000986   

                        B_thetaiotaomicron  B_uniformis  B_vulgatus  \
B_caccae                          0.066630    -0.002823    0.106982   
B_cellulosilyticus_WH2            0.067358     0.000809    0.106279   
B_ovatus                          0.067358     0.000795    0.106279   
B_thetaiotaomicron                     NaN     0.000795    0.106279   
B_uniformis                       0.067358          NaN    0.106279   
B_vulgatus                        0.067358     0.000795         NaN   
C_aerofaciens                     0.067358     0.000795    0.106279   
C_scindens                        0.067358     0.000795    0.106279   
C_spiroforme                      0.067358     0.000795    0.106279   
P_distasonis                      0.067358     0.000795    0.106279   
R_obeum                           0.067358     0.000795    0.106279   
...
C_scindens              0.001470  
C_spiroforme            0.001470  
P_distasonis            0.001470  
R_obeum                      NaN  

As you can see, the results for old should be a matrix where the values for all the rows in a column should be the same, and it it true for most of the taxa but it fails for the first 3 taxa. This happens consistently across my 15 samples and even when using different media.

  1. So what I tried to do to verify what was happening was to clone the repo and modify it locally, so that when I ran com.knockout_taxa() it returns me not only the new growth rates (when using method=raw) but also the old values. The and, something that is also strange for me is that the values I am obtaining are quite different to the previous calculation (1.).
def knockout_taxa(community, taxa, fraction, method, progress, diag=True):
    """Knockout a taxon from the community."""
    with community as com:
        check_modification(com)
        min_growth = _format_min_growth(0.0, com.taxa)
        _apply_min_growth(com, min_growth)

        com.objective = com.scale * com.variables.community_objective
        community_min_growth = (
            optimize_with_retry(com, "could not get community growth rate.") / com.scale
        )
        regularize_l2_norm(com, fraction * community_min_growth)
        old = com.optimize().members["growth_rate"]
        results = []

        iter = track(taxa, description="Knockouts") if progress else taxa
        for sp in iter:
            with com:
                logger.info("getting growth rates for %s knockout." % sp)
                [
                    r.knock_out()
                    for r in com.reactions.query(lambda ri: ri.community_id == sp)
                ]

                sol = optimize_with_fraction(com, fraction)
                new = sol.members["growth_rate"]
                if "change" in method:
                    new = new - old
                if "relative" in method:
                    new /= old
                results.append(new)

        old = old.drop("medium", axis=1)
        ko = pd.DataFrame(results, index=taxa).drop("medium", axis=1)
        ko = ko.loc[ko.index.sort_values(), ko.columns.sort_values()]
        if not diag:
            np.fill_diagonal(ko.values, np.NaN)

        return ko, old
com = load_pickle('mouse_01'.pickle)
diet = load_qiime_medium("vmh_high_fiber.qza")
com.medium = diet.flux
new, old = com.knockout_taxa(fraction=0.6, method='raw', diag=False)
print(old)
                                          growth_rate
B_caccae                          0.0135695707228535
B_cellulosilyticus_WH2        0.0272505030167579
B_ovatus                                  0.0114618034691539
B_thetaiotaomicron                0.119902654224808
B_uniformis                       0.0132031895100882
B_vulgatus                        0.105075269656302
C_aerofaciens                         0.00680013487305695
C_scindens                        0.00386220655742489
C_spiroforme                          0.0250118398731469
P_distasonis                          0.058633920577937
R_obeum                       0.0226188024707752

Context

``` Package Information ------------------- micom 0.33.2 Dependency Information ---------------------- cobra 0.29.0 highspy missing jinja2 3.1.3 osqp 0.6.5 scikit-learn 1.2.2 scipy 1.11.4 symengine 0.11.0 Build Tools Information ----------------------- pip 23.3 setuptools 68.0.0 wheel 0.41.2 Platform Information -------------------- Linux 6.5.0-26-generic-x86_64 CPython 3.10.13 ```
cdiener commented 7 months ago

Hi, thanks for the report. I was not able to reproduce it on a smaller example. The "old" values are simply the growth rates from a cooperative tradeoff run, so they should be the same as running com.cooperative_tradeoff(fraction=0.6). It looks like in your case 2 successive runs yield different growth rates which is buggy for sure. Maybe you could verify if that is the case.

What solver are you using (output of com.solver)? It looks like you don't have HIGHSpy installed which might mean you are still running the old OSQP solver which can have problems with accuracy. So updating to the latest MICOM version might help already, but it might also be something else.

alejandrocs98 commented 7 months ago

Hi! Thank you for your response. I was running the simulations on my home institution's HPC cluster and then doing the further analysis on my PC. I decided to run everything on my PC and the consistency problem between runs was solved. Also, here is the output for com.solver (I am using GUROBI as a solver):

<optlang.gurobi_interface.Model at 0x706ed8ba3ee0>

Nevertheless, now I have a new doubt. Now when I perform exactly what I mentioned in my initial issue report, the results for both methods (old = change - new, and modyfing the code for directly retrieving the old values) I get exactly the same results, which is great! However, I ran com.cooperative_tradeoff(fraction=0.6) to check if I retrieved the same results and got different ones. Also, I checked the results that I got from running the growth workflow for that sample and also obtained different results. The repo in which I have been working in this project is a mess and I haven't push anything to the remote in a while, but I created a test repo with the data and code (in a jupyter notebook) that I have used to test this : https://github.com/alejandrocs98/my-micom-test.

cdiener commented 7 months ago

So the output from your KO modification and the growth results you loaded look the same to me. The differences there are within the solver tolerance which is 1e-6 but in practice can be slightly larger for individual variables. So a difference between 1e-20 and 1e-14 does not mean much, those are both essentially zero.

Your first cooperative tradeoff definitely looks different though. I will try to investigate a bit with the data you uploaded. Thanks for that!

alejandrocs98 commented 7 months ago

It makes perfect sense to me! Thank you so much :)