mgalardini / pyseer

SEER, reimplemented in python 🐍🔮
http://pyseer.readthedocs.io
Apache License 2.0
109 stars 27 forks source link

Question about lineage/loci association #195

Closed smb20200615 closed 2 years ago

smb20200615 commented 2 years ago

Hello,

I am interested to know whether there are sequence types and loci that are associated with my phenotype of interest. Is there a command you would recommend for this question? Also, I am not sure how to provide the distances parameter - is that from the phylogenetic tree similarity matrix.

Here is what I came up with. I get an error that the distances parameter has not been provided properly. I input --similarity based on the phylogenetic tree

        pyseer --lmm --lineage --lineage-file {output.lineage} --lineage-clusters {input.clusters} --min-af 0.02 --max-af 0.98 --no-distances --phenotypes {input.pheno} --similarity {input.similarity} --uncompressed --kmers {input.unitigs} --phenotype-column {params.pheno} --covariates {input.metadata} --use-covariates 2 3 --output-patterns {output.patterns} --cpu 30 > {output.significance}

Best wishes,

johnlees commented 2 years ago

I would recommend that you work through the GWAS tutorial first: https://pyseer.readthedocs.io/en/master/tutorial.html

Then have a look at the best practises guide: https://pyseer.readthedocs.io/en/master/best_practices.html

For an association with sequence types, you'll want to do use no distances: https://pyseer.readthedocs.io/en/master/usage.html#no-population-structure-correction

johnlees commented 2 years ago

In your given command, try removing no distances

pyseer --lmm --lineage --lineage-file {output.lineage} --lineage-clusters {input.clusters} --min-af 0.02 --max-af 0.98  --phenotypes {input.pheno} --similarity {input.similarity} --uncompressed --kmers {input.unitigs} --phenotype-column {params.pheno} --covariates {input.metadata} --use-covariates 2 3 --output-patterns {output.patterns} --cpu 30 > {output.significance}

Similarity is usually generated from a tree: https://pyseer.readthedocs.io/en/master/usage.html#phylogeny-based

smb20200615 commented 2 years ago

Thank you so much for your prompt reply. I actually don't have --no-distances in my command. Do you mean I should remove the --similarity and add --no-distances?

pyseer --lmm --lineage --lineage-file {output.lineage} --lineage-clusters {input.clusters} --min-af 0.02 --max-af 0.98  --phenotypes {input.pheno} --no-distances --uncompressed --kmers {input.unitigs} --phenotype-column {params.pheno} --covariates {input.metadata} --use-covariates 2 3 --output-patterns {output.patterns} --cpu 30 > {output.significance}
smb20200615 commented 2 years ago

Also a more general question. Does a unitig based GWAS also give information on gene presence absence? I know in the tutorial you recommend running both unitig and presence-absence of genes GWAS but intuitively I would expect that running both would be redundant.

Also thank you so much for the amazing resources. The brief introduction to GWAS, the best practices, and the tutorials have been an amazing resource for someone unfamiliar with the field

johnlees commented 2 years ago

Thank you so much for your prompt reply. I actually don't have --no-distances in my command. Do you mean I should remove the --similarity and add --no-distances?

pyseer --lmm --lineage --lineage-file {output.lineage} --lineage-clusters {input.clusters} --min-af 0.02 --max-af 0.98  --phenotypes {input.pheno} --no-distances --uncompressed --kmers {input.unitigs} --phenotype-column {params.pheno} --covariates {input.metadata} --use-covariates 2 3 --output-patterns {output.patterns} --cpu 30 > {output.significance}

In the command in the first post, no distances was included. Please can you give the output error message, and the exact command used.

Also a more general question. Does a unitig based GWAS also give information on gene presence absence? I know in the tutorial you recommend running both unitig and presence-absence of genes GWAS but intuitively I would expect that running both would be redundant.

Indeed it does, but sometimes associating the genes can be useful too (and easier to interpret)

Also thank you so much for the amazing resources. The brief introduction to GWAS, the best practices, and the tutorials have been an amazing resource for someone unfamiliar with the field

Thank you, glad you find them useful

smb20200615 commented 2 years ago

The command was:

        pyseer --lmm --similarity {input.similarity} --lineage --lineage-file {output.lineage} --lineage-clusters {input.clusters} --phenotypes {input.pheno} --uncompressed --kmers {input.unitigs} --covariates {input.metadata} --use-covariates 2 4 --output-patterns {output.patterns} --cpu {threads} > {output.significance}

The error was "Must also provide a distance matrix to report lineage effects"

I consulted this repo and found this command https://github.com/mgalardini/2018_ecoli_pathogenicity/blob/master/Snakefile

      'pyseer --phenotypes {input.phenotype} --phenotype-column killed --kmers {input.kmers} --uncompressed --max-dimensions {params.dimensions} --lineage --lineage-file {output.lineage} --cpu {threads} --output-patterns {output.patterns} --distance {input.dist} --lmm --similarity {input.sim} --lineage-clusters {input.baps} 2>&1 > {output.associations} | grep \'h^2\' > {output.h2}'

Does this mean that I need both a mash distance and a phylogeny similarity matrix?

Also can we have categorical variables for the phenotype to the GWAS? From the manual it seems like pyseer only accepts binary and continuous phenotypes.

Sorry to add more questions but I tried to use enet to compare with the GWAS output. I get an R^2 of -2.7 in the prediction stage for an external dataset but 0.8 in the training. I am guessing the perfect correlation would be an R^2 of 0.1, Does this mean that the genomic analysis on the training is not generalizable to other datasets? Also, does this mean I shouldn't trust the GWAS results as there may not be a strong genetic component to the phenotype since it doesn't generalize across studies?

johnlees commented 2 years ago

The command was:

        pyseer --lmm --similarity {input.similarity} --lineage --lineage-file {output.lineage} --lineage-clusters {input.clusters} --phenotypes {input.pheno} --uncompressed --kmers {input.unitigs} --covariates {input.metadata} --use-covariates 2 4 --output-patterns {output.patterns} --cpu {threads} > {output.significance}

The error was "Must also provide a distance matrix to report lineage effects"

I consulted this repo and found this command https://github.com/mgalardini/2018_ecoli_pathogenicity/blob/master/Snakefile

    'pyseer --phenotypes {input.phenotype} --phenotype-column killed --kmers {input.kmers} --uncompressed --max-dimensions {params.dimensions} --lineage --lineage-file {output.lineage} --cpu {threads} --output-patterns {output.patterns} --distance {input.dist} --lmm --similarity {input.sim} --lineage-clusters {input.baps} 2>&1 > {output.associations} | grep \'h^2\' > {output.h2}'

Does this mean that I need both a mash distance and a phylogeny similarity matrix?

Ah in this case, yes. The similarity matrix is used for the actual GWAS, and the distance matrix is used for the lineage effect p-values. You will therefore need both here. If you just want to run the association, remove the two lineage options and max dimensions and it should work.

Also can we have categorical variables for the phenotype to the GWAS? From the manual it seems like pyseer only accepts binary and continuous phenotypes.

This is correct. In pyseer, you can divide a categorical variable into multiple binary outcomes if you wish to analyse it.

Sorry to add more questions but I tried to use enet to compare with the GWAS output. I get an R^2 of -2.7 in the prediction stage for an external dataset but 0.8 in the training. I am guessing the perfect correlation would be an R^2 of 0.1, Does this mean that the genomic analysis on the training is not generalizable to other datasets? Also, does this mean I shouldn't trust the GWAS results as there may not be a strong genetic component to the phenotype since it doesn't generalize across studies?

I suppose this would be with a binary phenotype, you can sometimes get unintuitive R^2 values. Perfect would be R^2 = 1.

As for the inference here: maybe, it doesn't look like a good predictor outside of the dataset in this case. It can be useful to plot predicted and actual phenotype values to get more insight. But this doesn't necessarily mean GWAS findings aren't replicable, as these may only form a small part of R^2

smb20200615 commented 2 years ago

Thank you for your thorough explanation! I was wondering what you mean by "But this doesn't necessarily mean GWAS findings aren't replicable, as these may only form a small part of R^2".

Also, I couldn't find in the manual where we could find the predicted and actual phenotype. Would this be in the selected.txt file?

johnlees commented 2 years ago

Thank you for your thorough explanation! I was wondering what you mean by "But this doesn't necessarily mean GWAS findings aren't replicable, as these may only form a small part of R^2".

As an example: you may have a very significant p-value, and strong effect size, but low allele frequency. This may be causal in the isolates it is present in, but is unlikely to predict the phenotype well for all the isolates (which is what R^2 measures). It may not even be present in another population – but the effect and its cause are still real.

Also, I couldn't find in the manual where we could find the predicted and actual phenotype. Would this be in the selected.txt file?

The actual phenotype would be from your own data. Predicted phenotype see https://pyseer.readthedocs.io/en/master/predict.html. You'll need to run enet_predict on your data.

smb20200615 commented 2 years ago

Thank you so much for your explanations. Final question on this thread but I am using panaroo's presence absence matrix for a presence-absence GWAS. I was wondering how to differentiate ssl5_1~ssl5_2~ssl5_3~ssl5 and ssl5_2~ssl5_3~ssl5_1. I see one is positively associated with the phenotype and the other is negatively associated. Are they both different variants of the ssl gene? What about psuK~psuK_2~RBKS_1~psuK_1 where it seems like there are multiple genes separated by ~~. Also, since I am only interested in the presence and absence of genes, should I somehow remove these ~~ separated genes?

johnlees commented 2 years ago

I'm afraid this is a panroo question, not a pyseer question 🙂

smb20200615 commented 2 years ago

Agreed. Thanks again for your prompt and through explanations