Mapping hits using a pangenome

mgalardini / pyseer

SEER, reimplemented in python 🐍🔮

http://pyseer.readthedocs.io

Apache License 2.0

111 stars 27 forks source link

Mapping hits using a pangenome #183

Open LeonardosMageiros opened 3 years ago

LeonardosMageiros commented 3 years ago

Hi,

I have executed pyseer and I would like to map my unitigs not to the list of my input files but back to just one fasta/gff file that represents my pangenome.

This file contains representative sequences of all the genes in my dataset (produced using roary). So if I understand things correctly using it as a ref in your annotate script should map all the hits in my results back to a gene. Nonetheless I have many unmapped unitigs in my output.

Am I missing something? Any help would be much appreciated.

Best Leonardos

johnlees commented 3 years ago

For a single reference, you probably want to use this method rather than the annotate script: https://pyseer.readthedocs.io/en/master/usage.html#mapping-to-references-phandango

There will be many unitigs from assemblies that are not in a roary pangenome - those in intergenic regions, those not annotated correctly in the input etc. This also sounds like a mapping issue if you're using representatives. Some genes are very diverse and you may need to use different settings for the mapping, or a more sensitive aligner such as bowtie2.

This isn't something we directly support in pyseer however

LeonardosMageiros commented 3 years ago

Thank you very much for your quick response. Things are clearer now.

Just one last question to see if my understanding is correct.

I have a dataset of ~1500 strains and I have performed 2 GWAS runs on two different subsets of isolates. My goal is to find the overlap of these two runs on the gene level.

If I use the annotate script with my pangenome on top as ref and then all the strains participating in the two GWAS runs below as draft sequences, would the mapping of the unitigs be consistent in the two cases so I can calculate the overlapping genes?

Do you see any reason for the above approach not to work and if yes do you suggest another approach to tackle my problem?

Thank you very much in advance Leonardos

johnlees commented 3 years ago

Annotation is all done through mapping, so is with respect to whichever reference you use. If you use the same sequences to map to, they will have the same co-ordinate system so can be compared directly.

LeonardosMageiros commented 2 years ago

Can I also ask if there is a way to map the results of a gene presence absence analysis to the same sequences as I map my unitigs GWAS?