False positive results caused by multicopy sequences such as rRNA or Transposase

Daikuang commented 3 years ago

I ran pyseer try to get some specific genes/SNPs in strains from different sources (eg: blood vs feces). I get some significant results with extremely low p values (high -log10(p-value)) which most in rRNA or transposase regions. However, I doubt I get the false positive results since almost all those kmers in rRNA or transposase regions have an extremely low p values. So I wonder if it is possible there are any bugs when the significant kmers occurs in multicopy regions. Another question is when I use phandango_mapper to map the significant kmers to reference genomes，could the plot results differentiate the kmers in different locations if those kmer sequences are multicopied? Thanks， Dai

johnlees commented 3 years ago

It would help if you could share your QQ plot, sample size, species, model used, and command that you ran. But, this could be caused by a number of things:

Poor mapping of the k-mers. Results are real, but cannot distinguish between copies of sequence shorter than the k-mer length (as in rRNA and transposases)
Poorly controlled population structure. Similarly, if you have transposons which appear independent of genetic background in different strains, but strains have different proportions of phenotype positive, this can result in a synthetic association.
Low frequencies of the k-mers, which should be filtered out.
A true association.

Therefore I think the reason for this is biological/genetic, rather than in the software.

With respect to plotting all mapping positions for k-mers, the script should also plot any secondary mappings observed for each sequence. Could you elaborate on the issue you are having with this?

Daikuang commented 3 years ago

Thanks a lot.

My QQ-plot is attached. The species is klebsiella pneumoniae and I used about 500 genomes from one source and about 600 genomes from another sources.

Basicly, I ran the command and did the analysis follow the GWAS tutorial that you used for penicillin resistance GWAS analysis.

I used sources of strain as phenotypes input, kmers file generated by fsm-lite as kmers input and mash file as population structure input:

pyseer --phenotypes phenotypes.tsv --kmers kmers.gz --distances structure.tsv --min-af 0.01 --max-af 0.99 --cpu 15 --filter-pvalue 1E-8 > pyseer.assoc

Best, Dai

At 2021-03-29 19:57:34, "John Lees" @.***> wrote:

It would help if you could share your QQ plot, sample size, species, model used, and command that you ran. But, this could be caused by a number of things:

Poor mapping of the k-mers. Results are real, but cannot distinguish between copies of sequence shorter than the k-mer length (as in rRNA and transposases) Poorly controlled population structure. Similarly, if you have transposons which appear independent of genetic background in different strains, but strains have different proportions of phenotype positive, this can result in a synthetic association. Low frequencies of the k-mers, which should be filtered out. A true association.

Therefore I think the reason for this is biological/genetic, rather than in the software.

With respect to plotting all mapping positions for k-mers, the script should also plot any secondary mappings observed for each sequence. Could you elaborate on the issue you are having with this?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

johnlees commented 3 years ago

I can't see the QQ plot in the message, but might I suggest using the LMM mode as well, following the best practises

mgalardini commented 3 years ago

Closing for lack of follow-up messages

mgalardini / pyseer

False positive results caused by multicopy sequences such as rRNA or Transposase #144