Explanations Needed with Quantifications

amcomeau commented 6 months ago

Hello again! Thanks for the quick responsiveness on my previous question. Now that I have the software working, I'm playing around with the interpretation of MGS data coming out of our MicrobiomeHelper pipeline (https://github.com/LangilleLab/microbiome_helper/wiki) that we use for our own data + offer to clients of our core - if I can get KEGGCharter to work well, we might like to include this in our new MH ver2.0 that might be coming along in 2024. We already are developing a tool for visualizing the stratified output (JarrVis: https://github.com/dhwanidesai/JarrVis) using interactive Sankey diagrams and KEGGCharter could be a nice complement to that for the metabolic maps part, since we don't have a good visualizer for that now (plus we could write some nice scripts to convert our pipeline data to "talk" between the two).

That being said, I have a few questions regarding how the quantifications are handled (PS: there also seem to be some legacy references to --genomic-columns when I think you mean -qcol) - I checked the paper and your wiki here, but there are a few things I wanted to ask and thought would be nice to have them here for other people to see. I initially started playing around with my full data file for input (only using the first two samples to start) when I encountered an issue that the color scale in the "MT" mode didn't seem to match the input RPKM values and so I made a little mock-up example file instead to be able to test and ask the below questions:

EC  N1  N2
1.7.1.4 4000    2000
1.9.6.1 400 200
1.7.2.5 40  20

...for this small test example, I've simply restricted to 3 of the EC numbers corresponding to the Nitrogen Metabolism pathway, which were in our original dataset and are of particular interest to us. After I run this through KC (keggcharter -f TestInput-forKEGGCharter.txt -o KC_test_run -it 'MirallesMGS-CEMEX2018' --map-all -t 40 -ecc 'EC' -qcol 'N1,N2' -mm 00910), I get the following output in the KEGGCharter_results.tsv:

Function | N1 | N2 | Taxon (KEGGCharter) | KO (ec-column) | EC (ec-column) | KO (KEGGCharter) | EC number (KEGGCharter) -- | -- | -- | -- | -- | -- | -- | -- 1.7.1.4 | 4000 | 2000 | MirallesMGS-CEMEX2018 | K00361,K17877,K26138,K26139 | 1.7.1.4 | K00361,K17877,K26138,K26139 | 1.7.1.4 1.7.2.5 | 400 | 200 | MirallesMGS-CEMEX2018 | | | | 1.7.2.5 1.9.6.1 | 40 | 20 | MirallesMGS-CEMEX2018 | | | | 1.9.6.1 | | | | K02567 | 1.9.6.1 | | | | | | K04561 | 1.7.2.5 | |

..and then the following for the data_for_charting.tsv:

Function | N1 | N2 | Taxon (KEGGCharter) | KO (ec-column) | EC (ec-column) | KO (KEGGCharter) | EC number (KEGGCharter) -- | -- | -- | -- | -- | -- | -- | -- 1.7.1.4 | 4000 | 2000 | MirallesMGS-CEMEX2018 | K00361,K17877,K26138,K26139 | 1.7.1.4 | K00361 | 1.7.1.4 1.7.1.4 | 4000 | 2000 | MirallesMGS-CEMEX2018 | K00361,K17877,K26138,K26139 | 1.7.1.4 | K17877 | 1.7.1.4 1.7.1.4 | 4000 | 2000 | MirallesMGS-CEMEX2018 | K00361,K17877,K26138,K26139 | 1.7.1.4 | K26138 | 1.7.1.4 1.7.1.4 | 4000 | 2000 | MirallesMGS-CEMEX2018 | K00361,K17877,K26138,K26139 | 1.7.1.4 | K26139 | 1.7.1.4 1.7.2.5 | 400 | 200 | MirallesMGS-CEMEX2018 | | | | 1.7.2.5 1.9.6.1 | 40 | 20 | MirallesMGS-CEMEX2018 | | | | 1.9.6.1

...this test data then results in the following map:

differential_Nitrogen metabolism

Therefore, a few questions/problems are coming up here:

Why are there two lines (for K02567+4561) down below the lines of the actual data in the KEGGCharter_results file above, instead of having those lines in-line with their corresponding 1.7.2.5+1.9.6.1 entries in the two lines above that to which they seemingly belong? My actual file of the full dataset has 2880 lines of actual data, whereas the KC-processed file then has an extra 1130 lines after the main data in the same way as above (ie: only in the "KO (ec-column)" and "EC (ec-column)" columns).
In the data_for_charting, there are data cells missing info and I suspect this is related to your having a minimum score cutoff for inclusion, but I can't seem to find that written anywhere on the Github, nor in the paper - what is it and can it be modified? We express our MGS gene counts normalized as RPKM and so the scale of numbers can look comparatively very small and so could be a problem (would prefer not to have to transform them and then redraw all the scales). This is also then reflected in the final map which does not have the last two "low count" ECs included. Similarly, in my full dataset file processed by KC (inflated from the 2880 original lines to 5123), the "KO (ec-column)" and "EC (ec-column)" columns stop at line 3819 (as does the next KO (KC) line, but the EC (KC) column goes to the end of the 5123).
Also in the data_for_charting, which obviously then the basis for the coloring of the final maps, you can see the 1.7.1.4 is associated with 4 KOs, therefore the lines are repeated 4 times, however the mapping is then summing those supposed 4,000 counts into 16,000 which then massively overinflates the total counts. If that EC is associated with four KOs, then at a maximum those 4,000 counts could be distributed among those 4 KOs, but not present 4,000 times in each KO - I would think the idea here, given no a priori info about which KOs they might be, would have been to evenly divide those 4,000 counts by the number of KOs to make avg. 1,000 for each, so that the final summed value would match the original, no?
Finally, since many of us would be using the "MT" mode for MGS data (gene counts instead of expression read depth) and the values are then absolute/comparative and not differential (ie: there is no 0 being average), the use of white as a color in the middle of the scale is problematic as it looks like "no counts" or "gene not present" for those boxes that happen to fall in that middle range. Is there a way to change the color scale used?

It is a shame about these current issues, as I had expected KC to basically be a fully automated way to get the same map coloring as doing KEGGMapper manually...however, when I use the above mock/test data in KM manually, this is what I get:

OnlineKEGGMapper

That map then faithfully reproduces the right relationships between the 4,000/2,000 counts (max out their values in each instance of 1.7.1.4 instead of multiplying), and then displays the lower 400/200 + 40/20 mock numbers. Of course, there we can also switch the color scale to avoid the white.

At the moment, I'd be forced to go this way of inputting the values manually into the KM pop-up box online instead of resorting to using KC...however, it would be nice to see if we can get it working! I don't have a ton of maps to do with my current paper I'm working on (just a few of the C/N/P ones), but still nice to see for here and eventually being able to offer to all clients/users of our pipeline. Thanks in advance for answering this long post!

iquasere commented 6 months ago

Thank you for exposing these issues in such a structured way. It really helps with development, having users report on these issues. And there are indeed a lot of them. Since I am the sole developer, tester and maintainer of KEGGCharter, there are many problems that have slipped by me. I hope more users can report on them as you have, so KEGGCharter can be improved. To answer the problems one by one:

I hadn't seen this before. I'll test with the example you provided, and report on here. If it's reproduced, it will be fixed next version.
No minimum cutoff is present in KEGGCharter, although it isn't a bad idea. This stems from the bug reported in 1), KEGGCharter filters the input dataset to remove lines with no IDs, here it's considering the absence of value in the Function column as evidence that the row has no information to offer. Once 1) is fixed, this will not be important.
Since version 1.0, KEGGCharter has the parameter --distribute-quantification, to divide quantification values by the different K numbers. The division of quantification was the only option before, but now it allows to not do it, and is the default behavior I left it as default as it's easier to interpret directly from the input datasets. Maybe the default behavior should be dividing, nevertheless, the behavior depends on the parameter.
This logic makes sense. I'll look into, at minimum, put a more appropriate colormap for those maps. But will try to set it as an option, if I see many different options are available and appropriate, with the default always being a different colour scheme than the one currently implemented, as it is indeed not the best option.

I'm going to work on this either this week or the next. The next KEGGCharter will have this sorted out. The comment about the -qcols and -gcols is also correct, and I'll check in the documentation where it's still not updated.

iquasere commented 6 months ago

I have fixed everything, except for the quantification values, which KEGGCharter is reporting a bit differently from the result you showed.

differential_Nitrogen metabolism

Still trying to figure out why. Should be something with the mapping of IDs, since the quantification values are being distributed now.

amcomeau commented 6 months ago

This has been very quick! Sorry I missed the -dq parameter - this seems to be what I want it to do. The new color scale is better - yellow on white is sometimes a bit hard to see, but if it stays that darker shade of it then should be OK. Generally a good idea to avoid shades of green (with close yellow) due to Red-Green colorblindness. Some of the better scales are Red-to-Blue or Yellow-to-Blue. What do you mean above that the quantification values in KC are not working still? The pic you have above seems to be exactly matching the online KEGGMapper version I showed - maxing out at 4000 annd properly showing the 2000 value and then the smaller 20-400 values in darker green. What are you thinking is wrong?

iquasere commented 6 months ago

About the colormap: I was experimenting with viridis, which seems to be consensually a good colormap, avoiding the green-red colorblindness. However, I felt the darker spectrum of the colors makes it very hard to see the EC numbers/box labels.

differential_Nitrogen metabolism

Therefore, I decided to keep summer as default, and if people don't mind darker maps, they can now pick viridis or another colormap.

Actually, I would have prefered a lighter map like this but in the colors you mentioned, so as to avoid the colorblindness problem. If you know of such a matplotlib colormap, please let me know. I also tried tinkering with adjusting brightness, spent a couple hours on that, but couldn't make it work for both the map and the side label (only for the map).

About the erroneous quantification values, it was my human eye confusing a three color system with a two color system. Seems good, and thank you again so much for your help in testing KEGGCharter. If you do find another problem, please don't hesitate in contacting again.

I have just released KEGGCharter 1.1, which fixes everything mentioned here. Hope it is to your liking!

amcomeau commented 6 months ago

OK thanks again for your work! In general, I do like the viridis palette and it is quite recommended - it is a shame we can't make the labels in the boxes white, since that would solve the dark box problem (but then be a pain for the yellow ones). I was probably going to to over the labels in a drawing program for the final publication figures, if need be.

I will download the latest version and continue my testing - I'll open a new issue if anything else comes up!

amcomeau commented 6 months ago

PS: I'm trying the new version now (software --help says viridis is default)...one language thing I forgot to mention is that you need to change your references to "input" in the software and wiki to "impute" where you are talking about using a placeholder value (ie: imputation of quantifications or taxonomies). It can be confusing for people especially in the context of software as we are often looking for "inputs" in program packages, and you have input files, but you mean impute missing values in this context.

iquasere / KEGGCharter

Explanations Needed with Quantifications #14