Closed kneubehl closed 1 year ago
Hi, @kneubehl
pgv-mummer
(and pgv-mmseqs
) only plots CDS features that have a translation qualifier in the Genbank file.
Therefore, as you pointed out, CDS features of pseudogenes without translation qualifiers are not plotted.
I will consider how to handle the plotting of pseudogenes. Please wait a moment.
Pseudogene of the CDS feature is excluded in the extract_features()
method of the Genbank
parser class.
If MUMmer is using protein sequences it will be kinda messy to use the pseudogene translations. I was initially operating under the assumption that the pgv-mummer was using nucleotide sequence. But I am noticing 2 issues (1) differences between the nucleotide and protein-based results and (2) linkages between sequences that do not match up with a CDS (there is a pseudogene there so no arrow but still a linkage). (1) Seems to be at the edges of the genbank sequence. I included in the example picture the command I used and indicated where I see the two issues. This is an example of how the 5' and 3' terminal CDSs in the top sequence should have links between both sequences but there are no links in the protein default seqtype but there is when nucleotide is used as the seqtype option as well as linkages between sequences to areas where there are no CDSs plotted.
Sorry for the delay in answering.
(1) differences between nucleotide and protein-based results
MUMMer uses the nucmer
program for nucleotide-level sequence comparison and the promer
program for protein-level (6 reading frame translation) comparison. Since nucmer
and promer
use different algorithms for comparison, there are naturally some differences in comparison results.
This problem comes from MUMmer, so it is honestly difficult for me to solve the problem. It may be possible to reduce the differences in comparison results by adjusting the detailed parameters of MUMmer, but it is a bit troublesome to implement such a function.
(2) linkages between sequences that do not match up with a CDS (there is a pseudogene there so no arrow but still a linkage)
The newly released pyGenomeViz v0.3.1 solves this problem by using the --pseudo
option. It is also possible to change the color with the --pseudo_color
option.
Below is an example of comparison visualization with and without pseudogene in E.coli using pgv-mummer.
1. Disable --pseudo option
pgv-download-dataset -n escherichia_coli
pgv-mummer --gbk_resources NC_000913.gbk:235000-265000 NC_002695.gbk:270000-300000 \
-o ecoli_no_pseudo --feature_plotstyle arrow --link_track_ratio 3.0
2. Enable --pseudo option
pgv-download-dataset -n escherichia_coli
pgv-mummer --gbk_resources NC_000913.gbk:235000-265000 NC_002695.gbk:270000-300000 \
-o ecoli_pseudo --feature_plotstyle arrow --link_track_ratio 3.0 --pseudo
Thanks for adding the --pseudo option! I looked into those genes that are different between the nucleotide and protein (issue 1 in the previous message). Those proteins on baBA2lp27 are 88% similar to a protein on bhDAHlp59 but they don't show up on the visualization or in the align_coords.csv. Do you think this is a problem with mummer? I can send you the genbank files if you want take a look.
Do you think this is a problem with mummer?
Without seeing the data, it is difficult to determine the correct cause, but I think there are two possibilities.
Simply not finding a matching sequence region
The sequence comparison by MUMmer(promer) 6 reading frame translation may not have been done properly in the target sequence region, and the link may not have been output.
Target sequence region is excluded in the filtering stage
MUMmer provides a delta-filter
tool to filter redundant comparison results with multiple overlaps, which is also used in pyGenomeViz. This delta-filter
may be unexpectedly excluding target sequence regions.
Hello,
Awesome tool! I am using pgv-mummer v0.3.0 and I am noticing that the pseudogenes are not being plotted with arrows. The alignments for those regions are being plotted but there are no arrows present to represent where that "CDS" would be. Not sure if there is a setting I need to change but I suspect this would be with how the genbank files are read in and handled. Any guidance would be appreciated. Thanks!
-Alex