OTU Table Read Abundance vs. Present/Absent Data

timz0605 commented 4 months ago

Hello,

First of all thanks for the code and package! It is something I've been thinking of and trying to do, and love to see there have been work done in the past.

For the input OTU table, I was wondering if it only considers read counts data? We all know that many potential biases could be introduced during the PCR process and bioinformatics pipeline. Therefore, for many metazoan metabarcoding studies, people convert the read counts data to present/absent data (1 vs. 0) for downstream analyses. So, I am curious about what approaches this code takes.

lentendu commented 4 months ago

Hi,

there is no special implementation in the code to handle 1/0 data. If you use presence/absence data, you probably would like to skip the normalization of read counts by using the option: -n no The rest is based on Spearman's rank correlation and randomized matrix, so you still need to chose the null model that suits your data. I have not tested to analyze 1/0 data, in microbiology we also have the depth bias but we consider that the relative abundance is still a valuable information. Log or square-root transformations of relative abundance is then recommended to reduce the importance of hyper-abundant taxa, sometimes due to PCR amplification bias (i.e. using option -n ratio_log or -n ratio_sqrt). So, you might want to run NetworkNullHPC on a test dataset for which you are sure about the counts to investigate the potential impact of 1/0 transformation on the co-occurrence and co-exclusion results.

timz0605 commented 4 months ago

Hello @lentendu，

Thank you for the quick response!

I am relatively new to Linux system and running program that uses a combination of different languages. I was wondering if you could help me with the process? I am trying to run this locally on my computer, and I am using WSL. I have installed R in WSL along with all the required packages

lentendu commented 4 months ago

As mentioned in the readme, this tool is only for Linux server with a SLURM job scheduler.

The individual r scripts are available in the rscripts directory if you want to re-implement it in a single script, but I cannot invest time in it.

Alternatives are the original code of Connor, Barberàn and Clauset (2017) in Matlab, or a different way to produce networks, e.g. using RMThreshold R package to detect the correct Spearman's rank corrlation threshold, see for example Bunick et al. (2021)

timz0605 commented 4 months ago

Hello @lentendu,

I have had some preliminary success running the whole program (after some debugging and editing the script to fit the HPC I use), and I guess the next step for me will be playing around with adjusting the parameters to see how they affect my results.

Meanwhile, I want to double-check if I have the format for the OTU table correctly. You mentioned in readme that rows will be samples and columns will be OTUs, correct? Since usually, the OTU table output from the bioinformatics pipeline (say vsearch) will have OTUs as rows and samples/locations as columns.

timz0605 commented 4 months ago

Besides, I am also curious about how you visualize the network after you obtain the edge list as the final output. In the paper, you plotted the network where each node represents one OTU and an edge between two nodes represents significant co-occur. I was wondering if you ever had other thoughts or intuitions while exploring the data?

Right now, using all default options, I am only able to obtain approx. 10 pairs of OTU which have significant co-occur patterns (not ideal for visualizing using network methods). However, the median Spearman's rank correlation value for those pairs are all above 0.9. I was wondering if it's possible to select/filter/adjust for the threshold? E.g., all pairs with correlation value above 0.5 or 0.8 will be retained.

lentendu commented 4 months ago

Hi @timz0605 , here are my replies to your last questions:

the OTU table format follows standard in the R vegan package, that is site as rows and OTU/ASV/species as columns. You can easily transpose your matrix in R with function t() if needed.
for visualization, you can use igraph and ggnetwork packages in R, or other softwares like cytoscape or gephi
the heart of this co-occurrence network computation approach is to learn the appropriate Spearman's rank correlation threshold from your data, that is correlation not originating from random co-occurrence. The threshold can vary a lot depending on the size (number of sites and species) of your matrix. With small matrices or when using presence/absence data, the threshold will be relatively high. You should really avoid setting hard threshold. I do not know your data, but it might just be that only 10 pairs of OTU have non-random co-occurrences across your samples.

lentendu / NetworkNullHPC

OTU Table Read Abundance vs. Present/Absent Data #21