larssnip / micropan

R package for microbial pangenomics
21 stars 0 forks source link

rarefaction() produces the same maximum value for all the permutations #9

Closed matrs closed 3 years ago

matrs commented 3 years ago

Hello, thanks for this library, it's been very useful to me. I'm working with a table of orthougroups from orthofinder, which looks like this when converted into a matrix to use it with micropan v2.1

                                   OG0000000 OG0000001 OG0000002 OG0000003 OG0000004
GCF_000153905.1_ASM15390v1_genomic         7         5         5         5         4
GCF_000157975.1_ASM15797v1_genomic         6         6         3         0         2
GCF_000424085.1_ASM42408v1_genomic         7         8         3         6         4
GCF_000466565.1_ASM46656v1_genomic         9         7         6         5         4
GCF_000484655.1_ASM48465v1_genomic        17        13        14        11         8

When I compute the rarefaction curves with rar_tbl_all <- rarefaction(m_all, n.perm = 1000) , I noticed that the boxplots always have the same maximum value for every permutation. When I checked the rar_table_all, I see that for every permutation, It has multiple values equal to 14596:

> colSums(rar_tbl_all ==  14596)[1:10]
Genome  Perm1  Perm2  Perm3  Perm4  Perm5  Perm6  Perm7  Perm8  Perm9 
     0      2     14    117     80     52    186    206    150     71 

rarification_dist_sm

What could be a cause for this behaviour?

larssnip commented 3 years ago

Hello,

I think you have been mislead by the name of this function. Despite its name, it is not the same as a rarefaction from ecology. I will change its name!

This function will only consider if genes are present (1) or absent (0) in a genome, not their copy number. Next, it will permute the ordering of the genomes, and then compute the cumulative sum of genes being present after including 0,1,2,... genomes. Since it always ends by including all genomes, all curves end at the total number of genes each time.

I would have a look at the rarefy() function of the vegan-package, could this be what you are seeking?

matrs commented 3 years ago

Hello Lars, thanks for your clear explanation. Indeed, I was mislead by the name of the function, although it's my fault for not checking the details of the code. I'm a biologist but I'm new to this field, so I'm clueless about how commonly this confusion could arise. The function you mention from vegan looks like the one I was looking for. Thanks for your help.

Jose Luis