mgalardini / pyseer

SEER, reimplemented in python 🐍🔮
http://pyseer.readthedocs.io
Apache License 2.0
110 stars 27 forks source link

Reversing the phenotype.tsv and outputs #279

Open goodolgab opened 6 days ago

goodolgab commented 6 days ago

Hi!

Just a small question about the phenotypes.tsv.

I'm looking at associations between two different hosts in my dataset and I've created two separate files where the binary traits are reversed (With the key goal to see associated genes with either host).

host1.pheno.tsv Genome host host1 0 host2 1

host2.pheno.tsv Genome host host2 1 host1 0

However, the final output files for both are exactly the same. Is this what I should expect? Or should it be two different files with genes that are associated with either host?

I hope you can help!

johnlees commented 6 days ago

Usually I'd expect the same p-value, but opposite direction of effect (sign of beta). Is this what you see? It's the effect direction that tells you which host the association is with

goodolgab commented 5 days ago

Thanks for the speedy reply!

I've looked at both gene_hits.txt files and every file after the initial pyseer command:

pyseer --phenotypes host1.pheno.txt --kmers unitigs.txt.gz --distances phylonium_distance_matrix_done.tsv.txt --covariates covariates.txt --filter-pvalue 1E-5 --output-patterns patterns.txt --cpu 8 > pyseer.host1.assoc

pyseer --phenotypes host2.pheno.txt --kmers unitigs.txt.gz --distances phylonium_distance_matrix_done.tsv.txt --covariates covariates.txt --filter-pvalue 1E-5 --output-patterns patterns.txt --cpu 8 > pyseer.host2.assoc

And both of these .assoc files are exactly the same.

mgalardini commented 5 days ago

I have tried reproducing this behavior with the data we use to run the CI tests and I get opposite betas if the binary phenotype is reversed? could you please double check your inputs?

goodolgab commented 5 days ago

Thanks for helping out!

I did some digging and the initial pyseer.host1.assoc and pyseer.host1.assoc files I created using pyseer --phenotypes host1.pheno.txt --kmers unitigs.txt.gz --distances phylonium_distance_matrix_done.tsv.txt --covariates covariates.txt --filter-pvalue 1E-5 --output-patterns patterns.txt --cpu 8 > pyseer.host1.assoc had opposite betas.

I wanted to see which genes these unitigs mapped onto, so I continued on with the annotate_hits_pyseer and summarise_annotations.py functions and it creates the same gene_hits_host1.txt and gene_hits_host2.txt.

I looked at the avg_beta for both outputs and they're both the same value. Is this what I should be expecting? I understand that higher beta values suggest that the presence of the k-mer has a stronger effect or association with one of the hosts. But does a lower beta values suggest a weaker association or possibly an association with the other host? Should I have also expected negative beta values?

I hope you can help!

johnlees commented 4 days ago

This is as expected. The sign of the beta tells you which host the unitig is associated with. It may help to review some literature in the area e.g. https://figshare.com/articles/thesis/The_background_of_bacterial_GWAS/5550037?file=9624721 and linked references if this is unclear.

I would also recommend following: https://pyseer.readthedocs.io/en/latest/best_practices.html Particularly using the llm, which gives a much better false positive rate

goodolgab commented 2 days ago

Hi! Thanks for the doc links and the recommendations!

I've looked at the code + outputs and I think I've found out why the final outputs provide the same beta values (and not reversed as expected).

In the summarise_annotations.py script, there's a line that takes the absolute value beta = abs(float(anot_fields[4]))

removing the abs and replacing it with beta = float(anot_fields[4])

keeps the direction of beta.