Open goodolgab opened 6 days ago
Usually I'd expect the same p-value, but opposite direction of effect (sign of beta). Is this what you see? It's the effect direction that tells you which host the association is with
Thanks for the speedy reply!
I've looked at both gene_hits.txt files and every file after the initial pyseer command:
pyseer --phenotypes host1.pheno.txt --kmers unitigs.txt.gz --distances phylonium_distance_matrix_done.tsv.txt --covariates covariates.txt --filter-pvalue 1E-5 --output-patterns patterns.txt --cpu 8 > pyseer.host1.assoc
pyseer --phenotypes host2.pheno.txt --kmers unitigs.txt.gz --distances phylonium_distance_matrix_done.tsv.txt --covariates covariates.txt --filter-pvalue 1E-5 --output-patterns patterns.txt --cpu 8 > pyseer.host2.assoc
And both of these .assoc files are exactly the same.
I have tried reproducing this behavior with the data we use to run the CI tests and I get opposite betas if the binary phenotype is reversed? could you please double check your inputs?
Thanks for helping out!
I did some digging and the initial pyseer.host1.assoc and pyseer.host1.assoc files I created using
pyseer --phenotypes host1.pheno.txt --kmers unitigs.txt.gz --distances phylonium_distance_matrix_done.tsv.txt --covariates covariates.txt --filter-pvalue 1E-5 --output-patterns patterns.txt --cpu 8 > pyseer.host1.assoc
had opposite betas.
I wanted to see which genes these unitigs mapped onto, so I continued on with the
annotate_hits_pyseer
and
summarise_annotations.py
functions and it creates the same gene_hits_host1.txt and gene_hits_host2.txt.
I looked at the avg_beta for both outputs and they're both the same value. Is this what I should be expecting? I understand that higher beta values suggest that the presence of the k-mer has a stronger effect or association with one of the hosts. But does a lower beta values suggest a weaker association or possibly an association with the other host? Should I have also expected negative beta values?
I hope you can help!
This is as expected. The sign of the beta tells you which host the unitig is associated with. It may help to review some literature in the area e.g. https://figshare.com/articles/thesis/The_background_of_bacterial_GWAS/5550037?file=9624721 and linked references if this is unclear.
I would also recommend following: https://pyseer.readthedocs.io/en/latest/best_practices.html Particularly using the llm, which gives a much better false positive rate
Hi! Thanks for the doc links and the recommendations!
I've looked at the code + outputs and I think I've found out why the final outputs provide the same beta values (and not reversed as expected).
In the summarise_annotations.py script, there's a line that takes the absolute value
beta = abs(float(anot_fields[4]))
removing the abs and replacing it with
beta = float(anot_fields[4])
keeps the direction of beta.
Hi!
Just a small question about the phenotypes.tsv.
I'm looking at associations between two different hosts in my dataset and I've created two separate files where the binary traits are reversed (With the key goal to see associated genes with either host).
However, the final output files for both are exactly the same. Is this what I should expect? Or should it be two different files with genes that are associated with either host?
I hope you can help!