mgalardini / pyseer

SEER, reimplemented in python 🐍🔮
http://pyseer.readthedocs.io
Apache License 2.0
104 stars 25 forks source link

lineage specific unitigs #238

Closed snackens closed 1 year ago

snackens commented 1 year ago

Hello, thank you for this excellent tool.

I'd like to investigate genetic differences among the populations defined by population structure analysis. (pop1 n=67, pop2 n=250) But I got an error while creating qqplot.

At first, to count unitigs from all strains sequences, I used unitig-caller with following code, then obtained out_unitig.pyseer. unitig-caller --call --refs input.txt --out out_unitig

Then I performed pyseer with --no-distances option pyseer --phenotypes trait.txt --kmers out_unitig.pyseer --uncompressed --no-distances --min-af 0.05 --max-af 0.99 --cpu 15 > pyseer-out.txt This pyseer run finished within 1 hour.

pyseer-out.txt included a lot of bad-chisq.

Then, I tried to create qqplot, python qq_plot.py pyseer-out.txt but I got an error and qqplot was something strange.

pyseer/lib/python3.9/site-packages/pandas/core/arraylike.py:397: RuntimeWarning: invalid value encountered in log10 result = getattr(ufunc, method)(*inputs, **kwargs)

qq_plot

Could you please how should I interpret?

Best regard,

mgalardini commented 1 year ago

Could you please paste here your pyseer-out.txt table? from the looks of it you have negative p-values that have been used to make the qq-plot, which may indicate some problems with the table

snackens commented 1 year ago

Thank you for your reply. Here is the part of pyseer-out.txt. pyseer-out2.txt

In addition, actually I have 3 populations, so when I tried to investigate pop1 specific unitigs, I specified pop1 "0", pop2&3 "1" in trait file. Does it affect the results? And the number of samples was enough to analysis(n=67 vs n=250)?

Best regards,

mgalardini commented 1 year ago

Ah I see, this is a bug on formatting the output table when using --no-distances (I think). For now you can solve the issue by changing the header of the output table manually so that there is an additional column name (e.g. X) between intercept and notes. I'll folow up later with a fix

mgalardini commented 1 year ago

Found the error, and will merge as soon as tests are done. Will likely make a new release after that so the fix is available through conda. For now the manual fix should work.

snackens commented 1 year ago

Thank you so much. I tried to fix manually using following code. cat pyseer-out.txt | awk 'BEGIN{ OFS="\t" }{print $1,$2,$3,$4,$5,$6,$7,"X",$8}' > pyseer-out2.txt qq_plot2 Then I got this qqplot. This means p-values were inflated?

snackens commented 1 year ago

I think in this case, there are a lot of false negative unitings. Are there any suggestions to improve?

Thanks,

mgalardini commented 1 year ago

I believe that the relatively low sample size and the absence of population structure correction might be to blame. Are at least the lowest p-values given to things you were expecting?

snackens commented 1 year ago

Thank you for reply.

I tried to correct population structure using two ways.

When running the fixed effects mode, this qqplot was obtained.(fixed effects model) pyseer --phenotypes trait.txt --kmers out_unitig.pyseer --uncompressed --distances mash.tsv --min-af 0.05 --max-af 0.99 --cpu 15 > pyseer-u-mash.txt qq_plot-mash2

When running FaST-LMM, this qqplot was obtained. pyseer --phenotypes trait.txt --kmers out_unitig.pyseer --uncompressed --lmm --similarity genmotype_kinship.tsv --output-patterns unitigs_pattern.txt --min-af 0.05 --max-af 0.99 --cpu 15 > pyseer-u-llm.txt qq_plot-lmm3

I think the latter setting is better for my dataset, could you please give me your opinion?

Then I checked the gene which includes significant unitigs, the gene was one of interested genes detected by another analysis(not gwas).

Best regards,

mgalardini commented 1 year ago

They both look fine to me (with the caveat that I'm not familiar with the phenotypic data you have), and the second one looks like a textbook example of a qq-plot, and so you are possibly right.

Closing the issue for now but do let me know if anything else looks unclear.

snackens commented 1 year ago

Thank you for your kindness.