immunomind / immunarch

🧬 Immunarch: an R Package for Fast and Painless Exploration of Single-cell and Bulk T-cell/Antibody Immune Repertoires
https://immunarch.com
Apache License 2.0
311 stars 66 forks source link

Error with geneUsage on IGH #31

Closed majorkazer closed 4 years ago

majorkazer commented 4 years ago

🐛 Bug

I am trying to run geneUsage on a MiXCR dataset that I imported consisting only of IGH chains. I can run it to generate a histogram of v gene usage (using "hs.ighv") but when I try to look at other genes, it only can pull the ighv genes.

To Reproduce

The following will produce a histogram of v genes. v_usage = geneUsage(BCRdata.coding$data, "hs.ighv", .norm = TRUE, .ambig = "mag") vis(v_usage, .plot = "hist")

However, the following attempts to generate d and j gene histograms just return the same v gene histogram. d_usage = geneUsage(BCRdata.coding$data, "hs.ighd", .norm = TRUE, .ambig = "mag") vis(d_usage, .plot = "hist") or j_usage = geneUsage(BCRdata.coding$data, "hs.ighj", .norm = TRUE, .ambig = "mag") vis(j_usage, .plot = "hist")

Looking at d_usage or j_usage themselves, they are actually just the same as v_usage, a table of the v genes in my dataset. So it seems like geneUsage is having trouble pulling the correct genes from my dataset.

Additional context

sessionInfo() R version 3.6.1 (2019-07-05) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.3 LTS

other attached packages: [1] cowplot_1.0.0 openxlsx_4.1.3 immunarch_0.5.2 gridExtra_2.3
[5] data.table_1.12.6 dtplyr_1.0.0 dplyr_0.8.3 ggplot2_3.2.1

vadimnazarov commented 4 years ago

Hello @majorkazer ,

can you email us a sample of your data so we can reproduce the bug please? support@immunomind.io

Can you try running the code

j_usage = geneUsage(BCRdata.coding$data, "hs.ighj", .ambig = "inc")

and see what changed?

majorkazer commented 4 years ago

Hi @vadimnazarov,

Thanks for your help. My dataset is huge, so I have attached a downsampled dataset here.

I tried running the code you suggested, but still get the same result as described above.

BCRdata_downsampled.RData.zip

vadimnazarov commented 4 years ago

Hi @majorkazer ,

it seems that everything works OK on that version of immunarch:

screenshot

Do you have the tcR package installed? Can you remove both tcR and immunarch packages and install immunarch (only!) again, and try that again?

vadimnazarov commented 4 years ago

ping @majorkazer

majorkazer commented 4 years ago

Apologies for the delay.

I am running this analysis in a jupyter notebook runtime environment; I'm not sure if this may be part of the problem. I created a new notebook runtime environment and re-installed immunarch from github using:

devtools::install_url("https://github.com/immunomind/immunarch/raw/master/immunarch.tar.gz")

I still get the same result as I reported before (see attached image from notebook).

Screen Shot 2019-12-16 at 1 38 24 PM
vadimnazarov commented 4 years ago

Hi @majorkazer

We run the exact code you just wrote us on your data, and got correct results for all three gene segments. Can you please run the following code and check the output:

table(BCRdata.coding$data$J.name)
table(BCRdata.coding$data$D.name)
majorkazer commented 4 years ago

Hi,

Running either table function returns this: < table of extent 0 >.

However, if I run table(BCRdata.coding$data$clones$D.name), I get what we would expect:

Screen Shot 2019-12-16 at 2 46 13 PM

I don't have a way to test this locally as of right now, everything is being run on the runtime environment... I think there may be something weird going on with the environment and/or jupyter notebooks. Nevertheless, for the time being I can just use the outputs from table to generate my own frequency plots.

Thanks for your help with this! Happy to close this issue for the time being unless you want to explore further.

vadimnazarov commented 4 years ago

@majorkazer

after brainstorming with almost the whole team we managed to reproduce this tricky bug. Can you please try

j_usage = geneUsage(BCRdata.coding$data[[1]], "hs.ighj", .norm=T, .ambig="inc")
vis(j_usage, .plot = "hist")

Note the [[1]] for the data.

Alternatively, you can try

j_usage = geneUsage(BCRdata.coding$data[["clones"]], "hs.ighj", .norm=T, .ambig="inc")
vis(j_usage, .plot = "hist")
majorkazer commented 4 years ago

Success! I am able to generate the J and D frequency plots using your code.

So, out of curiosity, is this something weird with how MIXCR objects are brought in? Or a difference in the version of immunarch that I'm using?

Thanks for your help!

vadimnazarov commented 4 years ago

Perfect!

No, this is an issue in our code. geneUsage incorrectly processes the case when there is only one repertoire in the input list of repertoires. When you pass BCRdata.coding$data you pass a list with only one repertoire (i.e., the problematic case). When you pass BCRdata.coding$data[["clones"]] you pass this exact repertoire, and this is the case where everything works OK. Interestingly enough, when you pass a list with two or more repertoires, everything works OK, because we have a separate branch of code that processes the case with only one repertoire, and we definitely need to improve it.

@majorkazer , thank you for bringing this very tricky and non-obvious issue to us! We will fix it in our next release, and I will close the issue then.

We are conducting short interviews with specialists about pain points in AIRR analysis. We would like to learn more about the full pipeline from laboratory work to clinical insights. The interview is also a wonderful opportunity to discuss anything related to immune repertoire analysis and how to make it in immunarch and R. Should you find it appealing, here is a link to schedule our appointment: https://calendly.com/immunomind/interview

EugeneRumynskiy commented 4 years ago

The issue was solved.

talban14 commented 4 years ago

Sorry to open this back up but i'm having the same problem and the [[1]] does not fix it. I don't have tcR installed and am using immunarch directly in R. My data was loaded in from mixcr and it seems that everything else works great except this function.

specifying hs.trab. or hstrbj ect all returns the same object with all clones including IGH, IGK.

using the [[1]] selects clones but it seems to be doing it in a strange way. for example hs.trav will return any clone that has a V in the name IGHV3, TRAV11, IGKV2D. Same thing with hs.trab and hs.traj