moshi4 / pyGenomeViz

A genome visualization python package for comparative genomics
https://moshi4.github.io/pyGenomeViz
MIT License
280 stars 23 forks source link

Pseudogene arrows are not plotted in pgv-mummer #7

Closed kneubehl closed 1 year ago

kneubehl commented 1 year ago

Hello,

Awesome tool! I am using pgv-mummer v0.3.0 and I am noticing that the pseudogenes are not being plotted with arrows. The alignments for those regions are being plotted but there are no arrows present to represent where that "CDS" would be. Not sure if there is a setting I need to change but I suspect this would be with how the genbank files are read in and handled. Any guidance would be appreciated. Thanks!

-Alex

moshi4 commented 1 year ago

Hi, @kneubehl

pgv-mummer (and pgv-mmseqs) only plots CDS features that have a translation qualifier in the Genbank file. Therefore, as you pointed out, CDS features of pseudogenes without translation qualifiers are not plotted.

I will consider how to handle the plotting of pseudogenes. Please wait a moment.

moshi4 commented 1 year ago

Note for myself

Pseudogene of the CDS feature is excluded in the extract_features() method of the Genbank parser class.

https://github.com/moshi4/pyGenomeViz/blob/617caa6d0cf0defed4bf6faf957c1fd990e825d1/src/pygenomeviz/parser/genbank.py#L253-L257

kneubehl commented 1 year ago

If MUMmer is using protein sequences it will be kinda messy to use the pseudogene translations. I was initially operating under the assumption that the pgv-mummer was using nucleotide sequence. But I am noticing 2 issues (1) differences between the nucleotide and protein-based results and (2) linkages between sequences that do not match up with a CDS (there is a pseudogene there so no arrow but still a linkage). (1) Seems to be at the edges of the genbank sequence. I included in the example picture the command I used and indicated where I see the two issues. This is an example of how the 5' and 3' terminal CDSs in the top sequence should have links between both sequences but there are no links in the protein default seqtype but there is when nucleotide is used as the seqtype option as well as linkages between sequences to areas where there are no CDSs plotted. pygenomeviz_issue

moshi4 commented 1 year ago

Sorry for the delay in answering.

(1) differences between nucleotide and protein-based results

MUMMer uses the nucmer program for nucleotide-level sequence comparison and the promer program for protein-level (6 reading frame translation) comparison. Since nucmer and promer use different algorithms for comparison, there are naturally some differences in comparison results.

This problem comes from MUMmer, so it is honestly difficult for me to solve the problem. It may be possible to reduce the differences in comparison results by adjusting the detailed parameters of MUMmer, but it is a bit troublesome to implement such a function.

(2) linkages between sequences that do not match up with a CDS (there is a pseudogene there so no arrow but still a linkage)

The newly released pyGenomeViz v0.3.1 solves this problem by using the --pseudo option. It is also possible to change the color with the --pseudo_color option.

Below is an example of comparison visualization with and without pseudogene in E.coli using pgv-mummer.

1. Disable --pseudo option

pgv-download-dataset -n escherichia_coli
pgv-mummer --gbk_resources NC_000913.gbk:235000-265000 NC_002695.gbk:270000-300000 \
-o ecoli_no_pseudo --feature_plotstyle arrow --link_track_ratio 3.0

result

2. Enable --pseudo option

pgv-download-dataset -n escherichia_coli
pgv-mummer --gbk_resources NC_000913.gbk:235000-265000 NC_002695.gbk:270000-300000 \
-o ecoli_pseudo --feature_plotstyle arrow --link_track_ratio 3.0 --pseudo

result

kneubehl commented 1 year ago

Thanks for adding the --pseudo option! I looked into those genes that are different between the nucleotide and protein (issue 1 in the previous message). Those proteins on baBA2lp27 are 88% similar to a protein on bhDAHlp59 but they don't show up on the visualization or in the align_coords.csv. Do you think this is a problem with mummer? I can send you the genbank files if you want take a look.

moshi4 commented 1 year ago

Do you think this is a problem with mummer?

Without seeing the data, it is difficult to determine the correct cause, but I think there are two possibilities.

  1. Simply not finding a matching sequence region
    The sequence comparison by MUMmer(promer) 6 reading frame translation may not have been done properly in the target sequence region, and the link may not have been output.

  2. Target sequence region is excluded in the filtering stage
    MUMmer provides a delta-filter tool to filter redundant comparison results with multiple overlaps, which is also used in pyGenomeViz. This delta-filter may be unexpectedly excluding target sequence regions.