Validation with shotgun data

EricRaes commented 4 years ago

Hi Gavin,

Hope you are having a good day.

I was hoping I could bug you with a question. My aim is to do a validation with the KO outputs from both PICRUSt2 and my shotgun data as you outlined nicely in Nature paper and on Github rep: https://github.com/gavinmdouglas/picrust2_manuscript/blob/master/scripts/analyses/validations/16S_vs_MGS/16S_vs_MGS_KO_validations.R.

I have 12 samples for which I have 16S rRNA and shotgun data. I have been going through your script on github but unfortunately I am stuck.

a. In R I am loading the functions you listed in - https://github.com/gavinmdouglas/picrust2_manuscript/blob/master/scripts/picrust2_ms_functions.R b. I then execute the function ‘read_in_ko_predictions’ for my two KO files (pointing to my local paths) c. and load the ‘compute_ko_validation_metrics’ you listed in - https://github.com/gavinmdouglas/picrust2_manuscript/blob/master/scripts/analyses/validations/16S_vs_MGS/16S_vs_MGS_KO_validations.R

At the end, unfortunately I don’t follow you script anymore. When you wrote:

Loop over all dataset names: read in predictions (restrict to overlapping samples only,

and get subsets with all possible KOs that overlap across tools filled in) and compute performance metrics.

datasets <- c("hmp", "mammal", "ocean", "blueberry", "indian", "cameroon", "primate")

At which stage did you define your database names ? e.g., c("hmp", "mammal", "ocean", "blueberry", "indian", "cameroon", "primate"). I just have one project (two files; one for 16S rRNA and one for my shotgun data).

I am bit confused as to how I can get to the validation (Spearman-correlation) outputs.

Many thanks in advance for your help and suggestions!!

Best regards, Eric

MetaG_ALL_Stations_November_KO.txt PICRUSt2_ALL_Stations_November_KO.txt

gavinmdouglas commented 4 years ago

Hi there,

That code may take a little work to alter for use with other data.

However, to perform the correlations on just those two files you can do that with just a little custom R code. You'll want to restrict to KOs that could have been predicted as present by both approaches. You should then fill in 0s for any of these KOs that could have been predicted but weren't (you can look at the function you cited for the files I used for these purposes for PICRUSt2 and HUMAnN2). Finally you could sort and subset the tables to the same samples and KO orderings and then loop over every sample name (i.e. column name) to calculate the Spearman correlation for each with cor.test.

Hopefully that helps!

Gavin

On Wed, Jul 22, 2020, 2:06 AM EricRaes, notifications@github.com wrote:

Hi Gavin,

Hope you are having a good day.

I was hoping I could bug you with a question. My aim is to do a validation with the KO outputs from both PICRUSt2 and my shotgun data as you outlined nicely in Nature paper and on Github rep: https://github.com/gavinmdouglas/picrust2_manuscript/blob/master/scripts/analyses/validations/16S_vs_MGS/16S_vs_MGS_KO_validations.R .

I have 12 samples for which I have 16S rRNA and shotgun data. I have been going through your script on github but unfortunately I am stuck.

a. In R I am loading the functions you listed in - https://github.com/gavinmdouglas/picrust2_manuscript/blob/master/scripts/picrust2_ms_functions.R b. I then execute the function ‘read_in_ko_predictions’ for my two KO files (pointing to my local paths) c. and load the ‘compute_ko_validation_metrics’ you listed in - https://github.com/gavinmdouglas/picrust2_manuscript/blob/master/scripts/analyses/validations/16S_vs_MGS/16S_vs_MGS_KO_validations.R

At the end, unfortunately I don’t follow you script anymore. When you wrote: Loop over all dataset names: read in predictions (restrict to overlapping samples only, and get subsets with all possible KOs that overlap across tools filled in) and compute performance metrics.

datasets <- c("hmp", "mammal", "ocean", "blueberry", "indian", "cameroon", "primate")

At which stage did you define your database names ? e.g., c("hmp", "mammal", "ocean", "blueberry", "indian", "cameroon", "primate"). I just have one project (two files; one for 16S rRNA and one for my shotgun data).

I am bit confused as to how I can get to the validation (Spearman-correlation) outputs.

Many thanks in advance for your help and suggestions!!

Best regards, Eric

MetaG_ALL_Stations_November_KO.txt https://github.com/picrust/picrust2/files/4957881/MetaG_ALL_Stations_November_KO.txt PICRUSt2_ALL_Stations_November_KO.txt https://github.com/picrust/picrust2/files/4957882/PICRUSt2_ALL_Stations_November_KO.txt

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/picrust/picrust2/issues/129, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC7JHU25OASO5QOGV2EF4ILR4ZXUVANCNFSM4PEKLAOA .

EricRaes commented 4 years ago

Hi Gavin,

Many thanks for your swift reply!

I quickly wanted to do a sanity check and ask whether I am doing the right thing here.

As you said I restricted to KOs that could have been predicted as present by both approaches and then filled in 0s for any of these KOs that could have been predicted but weren't => this results in my final file "KOs_which_overlap_with_MetaG_and_Picrust2"

KOs_which_overlap_with_MetaG_and_Picrust2 <- read.csv( "KOs_which_overlap_with_MetaG_and_Picrust2.csv",header=T,row.names=1 )

subset <- c("Sample_1_MetaG_KO", "Sample_1_Picrust2_KO") newdata <- KOs_which_overlap_with_MetaG_and_Picrust2[subset]

ggscatter(newdata, x = "Sample_1_MetaG_KO", y = "Sample_1_Picrust2_KO", add = "reg.line", conf.int = TRUE, cor.coef = TRUE, cor.method = "spearman", xlab = "Sample_1_KO_PICRUST2", ylab = "Sample_1_KO_MetaG")

and then did a Spearman correlation; get a R2 value back and that's it ? :-)

Spearman_1_sample_PICRUSt2 and MetaG KOs

gavinmdouglas commented 4 years ago

Hi @EricRaes, yes that looks like the approach!

EricRaes commented 4 years ago

Thanks heaps!

picrust / picrust2

Validation with shotgun data #129

Loop over all dataset names: read in predictions (restrict to overlapping samples only,

and get subsets with all possible KOs that overlap across tools filled in) and compute performance metrics.