immunomind / immunarch

🧬 Immunarch: an R Package for Fast and Painless Exploration of Single-cell and Bulk T-cell/Antibody Immune Repertoires
https://immunarch.com
Apache License 2.0
311 stars 66 forks source link

Unknown group "NA" #86

Open decenwang opened 4 years ago

decenwang commented 4 years ago

Hi Dr. Nazarov,

When I plot by grouping method (.by = "group"), however, there is an unknown group NA existing in my graph. pleas see attached. I checked my metadata, no other grouping information, except the three group. andIalso checked the length of the metadata to make sure they are in the same length. If possible, please help check it out. Thanks a lot! Please see attached! And, one more question, How can I close the p-value bar when I plot the diversity by groups? like in you graph , the Chao1 https://immunarch.com/articles/web_only/v6_diversity.html Best,

Decen

imm.data_count_abundancea.txt metadata new.txt

imm data_count_abundance2

Also, another problem. when I filtered the sequences (aa sequence) with prop >=0.005, I got another dataset. Anyway, when I run imm.data_0.005_freq_chao1 <- repDiversity(imm.data_0.005_freq$data, "chao1"), I got a error implication as following: 图片1 In general, there are several sequences included in each individual set. But if I select the top10 aa sequences to compose a new dataset, and run the same command, there is no error. based on the error hint, I did not find what's wrong! there is a difference between "imm.data_0.005_freq" and "imm.data_top10" dataset, which includes zero sequences in "imm.data_0.005_freq" sets(even if the top1 sequence is less than 0.005, sometimes), but for the latter, that will not happen. Thanks !

decenwang commented 4 years ago

and I did not find the solution in the issue 7 https://github.com/immunomind/immunarch/issues/7. the command I used is very similar as you posted. Although I may use ggplot2 to re-plot, I think the vis() function is also a good choice. I don't where the bug is. Thanks!

vadimnazarov commented 4 years ago

@decenwang

Thank you SO much for the such detailed issue ticket, with all the plots and data attached! I hope you won't mind if we will use this ticket as an example of a good GitHub issues :-)

We will have a bug fixing spring soon, and we will be able to resolve this and other issues. Thank you so much for the patience, and let our team know if you encounter any other issues!

I checked your metadata and I see the following rows:

C39 5.19    Medium  PCa 5.1 5.5 16.1    129 2.3 1.9 0.5 3.8 1.211   67.895  0.605   78.079  8.575   0.0665  77
C44 8.3 Medium  PCa                                                         64
C45 8.72    Medium  PCa 6.15    5.35    15.8    160 4.1 1.4 0.3 4.667   2.929   114.286 0.879   140.571 9.925   0.062   75

The C44 sample doesn't have some metadata fields. Can you remove such samples (C44, L20, L33) from the data and the metadata, and try again? I see that this samples has their metadata field filled ("PCa"), but still the behaviour of R may be weird when it comes to missed fields. This small check will greatly help us to fix the bug faster.

decenwang commented 4 years ago

@decenwang

Thank you SO much for the such detailed issue ticket, with all the plots and data attached! I hope you won't mind if we will use this ticket as an example of a good GitHub issues :-)

We will have a bug fixing spring soon, and we will be able to resolve this and other issues. Thank you so much for the patience, and let our team know if you encounter any other issues!

I checked your metadata and I see the following rows:

C39   5.19    Medium  PCa 5.1 5.5 16.1    129 2.3 1.9 0.5 3.8 1.211   67.895  0.605   78.079  8.575   0.0665  77
C44   8.3 Medium  PCa                                                         64
C45   8.72    Medium  PCa 6.15    5.35    15.8    160 4.1 1.4 0.3 4.667   2.929   114.286 0.879   140.571 9.925   0.062   75

The C44 sample doesn't have some metadata fields. Can you remove such samples (C44, L20, L33) from the data and the metadata, and try again? I see that this samples has their metadata field filled ("PCa"), but still the behaviour of R may be weird when it comes to missed fields. This small check will greatly help us to fix the bug faster.

Hi Dr. Nazarov,

That's right. I also noticed that, and I filled out the blanks with similar data for the three samples. But the problem is still there. Anyway, How about the other two questions: turn off the p-value bar on the top and plot trouble as above?

Many thanks!

Decen

vadimnazarov commented 4 years ago

I see, thank you, Decen! Can you provide me a code that you executed in order to plot the length distribution graph, please?

RE: p-value. Sadly, we didn't implement it yet. However, this is an important suggestion, and we will implement it next. Thank you! Does it obstruct or stop you from doing something? I'm very interested in learning why do you think it's important to remove them

RE: plot trouble. Will look into it more!

decenwang commented 4 years ago

Hi Dr. Nazarov,

Thanks a lot for quick reply. Yes,

the code for length distribution is:

imm.data_0.001_freq_distribution_aa <- repExplore(imm.data_0.001_freq$data, .method = "len", .col = "aa") vis(imm.data_0.001_freq_distribution_aa, .by = "Status", .meta = mm.data_0.001_freq$meta).

In general, if I compare two groups in length or geneusage, it is OK, the P-value bar only cover a small area. but if I compare more than five group (like .by = c("Status","blood pressure")---this kind of cross-grouping), the p-value bar on the top will "eat" more space, thus, the real graph/hist/bar will be squeezed into a cornor. it is meaningless for reviewer. No offence. Anyway, immunarch is really really easy to use and friendly, which can save us a lot of time. Surely, I can use ggplot2 to re-plot, but this will take more time to rearrange the dataset, despite producing some gorgeous graphs by ggplot2.

for some question, I cannot make sure. In addition, I also noticed, for the V and J usage statistics. The arguments ".quant = "count"" might not work, the graph doesn't change. And when I try to output the result in to a CSV file, I found the statistics for "segment" and "allele" are the twins. Nothing changed in the two files, respectively.

In fact, I found there's no V-J combination statistics, previously. in some papers, the authors mentioned for some diseases, there's bias in V-J combinations. Hopefully, you would add in this module.

At last, I use the top() function to select the most frequent sequences to generate new dataset. However, I found the "Proportion" in each sample table (immdata_top100$data$case$Proportion) is the same as the originally unscreened dataset, but if I use the new dataset to analyze, the proportion will be re-calculated and displayed in graph. It is amazing! Surely, this will remove the heavy-tailed sequences to keep the top ones for its real profile. But in the output CSV file, it should be written with the re-calculated Proportion, not the original. Do you agree? I remembered you mentioned somewhere the "re-calculate" for data, anyway, I cannot find it again.

For the above questions, I cannot guarantee. If possible, please re-check. Many thanks! Expecting an R script to analyze the data like the mixcr! Best,

Decen

decenwang commented 4 years ago

plot_zoom_png like this figure, I made it on purpose. on the top, P-value bars cover more than half.