Closed LFelipe-B closed 1 year ago
Since pyCirclize is a flexible circular visualization tool, I think what you want to do would be possible. As for how to do it, I have no knowledge of the data content you are assuming, so it is difficult for me to answer your question.
If you could ask more specific questions with simplified data examples and output assumptions diagrams, I may be able to answer.
Hi moshi4, thanks for your quick reply. Sorry, I should have posted one example at least. Since I am not sure how the actual input for this would look like, although I guess it could be something as a bed file with a custom feature according to the best hit (eukarya, bacteria and so on). I was thinking in something as here: https://moshi4.github.io/pyCirclize/circos_plot/ the example "2. Escherichia coli", having a track with each colored line according to the best hit taxa (for ie. green eukarya, purple bacteria and so on) mapped to the coordinate of each CDS. Sorry if I cannot be more specific, but thank you anyway!
If you are able to link the protein_id in GFF's CDS to the best hit information for each taxon, then a plot like the example below is possible. Sorry if this is not what you are assuming.
from pycirclize import Circos
from pycirclize.parser import Gff
from pycirclize.utils import load_prokaryote_example_file
from matplotlib.patches import Patch
# Load GFF file
# https://github.com/moshi4/pycirclize-data/blob/main/prokaryote/enterobacteria_phage.gff
gff_file = load_prokaryote_example_file("enterobacteria_phage.gff")
gff = Gff(gff_file)
# Setup circos sector & track
circos = Circos(sectors={gff.name: gff.range_size})
sector = circos.sectors[0]
cds_track = sector.add_track((90, 100))
cds_track.axis(fc="#EEEEEE", ec="none")
# Best hit `protein_id` dummy results
eukaryote_best_hits = ["NP_050510.1", "NP_050507.1", "NP_050518.1", "NP_050526.1", "NP_050550.1", "NP_050579.1", "NP_050532.1", "NP_050503.1", "NP_050561.1", "NP_050574.1"]
prokaryote_best_hits = ["NP_050501.1", "NP_050546.1", "NP_050569.1", "NP_050506.1", "NP_050578.1", "NP_050544.1", "NP_050558.1", "NP_050547.1", "NP_050540.1", "NP_050534.1"]
for feature in gff.extract_features("CDS"):
# Assign color based on best hit `protein_id` dummy results
protein_id = feature.qualifiers.get("protein_id", [None])[0]
if protein_id in eukaryote_best_hits:
facecolor = "green"
elif protein_id in prokaryote_best_hits:
facecolor = "purple"
else:
facecolor = "lightgrey"
# Plot CDS
r_lim = (95, 100) if feature.strand == 1 else (90, 95)
cds_track.genomic_features([feature], plotstyle="arrow", r_lim=r_lim, fc=facecolor, lw=0.5)
# Save figure
fig = circos.plotfig()
_ = circos.ax.legend(
handles=[
Patch(color="green", label="Eukaryote Best Hit"),
Patch(color="purple", label="Prokaryote Best Hit"),
],
bbox_to_anchor=(0.5, 0.5),
loc="center",
)
fig.savefig("best_hit_example.png")
best_hit_example.png
Thank you looks great! I tried to customize a plot with your example code mixed with your codes from other examples. It's almost there. Thank you so much again! Here is my code:
# Load GFF file
from pycirclize import Circos
from pycirclize.utils import load_prokaryote_example_file
from pycirclize.parser import Gff
import matplotlib # to plot figures
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
gff_file = load_prokaryote_example_file("enterobacteria_phage.gff")
gff = Gff(gff_file)
circos = Circos(sectors={gff.name: gff.range_size})
circos.text("Enterobacteria phage\n(NC_000902)", size=15)
sector = circos.sectors[0]
outer_track = sector.add_track((99.6, 100)) # thickness of outer ring "99.7"
outer_track.axis(fc="black")
outer_track.xticks_by_interval(5000, label_formatter=lambda v: f"{v / 1000:.0f} Kb")
outer_track.xticks_by_interval(1000, tick_length=1, show_label=False)
f_cds_track = sector.add_track((88, 98), r_pad_ratio=0.1) # size of line changed to 98 before was 95
f_cds_track.genomic_features(gff.extract_features("CDS", target_strand=1), fc="#579bc4")
r_cds_track = sector.add_track((78, 88), r_pad_ratio=0.1)
r_cds_track.genomic_features(gff.extract_features("CDS", target_strand=-1), fc="red")
cds_track = sector.add_track((68,78))
cds_track.axis(fc="white", ec="none")
eukaryote_best_hits = ["NP_050510.1", "NP_050507.1", "NP_050518.1", "NP_050526.1", "NP_050550.1", "NP_050579.1", "NP_050532.1", "NP_050503.1", "NP_050561.1", "NP_050574.1"]
prokaryote_best_hits = ["NP_050501.1", "NP_050546.1", "NP_050569.1", "NP_050506.1", "NP_050578.1", "NP_050544.1", "NP_050558.1", "NP_050547.1", "NP_050540.1", "NP_050534.1"]
for feature in gff.extract_features("CDS"):
# Assign color based on best hit `protein_id` dummy results
protein_id = feature.qualifiers.get("protein_id", [None])[0]
if protein_id in eukaryote_best_hits:
facecolor = "green"
elif protein_id in prokaryote_best_hits:
facecolor = "purple"
else:
facecolor = "lightgrey"
r_lim = (68,78) if feature.strand == 1 else (68,78)
cds_track.genomic_features([feature], r_lim=r_lim, fc=facecolor)
fig = circos.plotfig()
_ = circos.ax.legend(
handles=[
Patch(color="green", label="Eukaryote Best Hit"),
Patch(color="purple", label="Prokaryote Best Hit"),
Patch(color="gray", label="Virus Best Hit"),
],
bbox_to_anchor=(0.5, 0.5),
loc="lower left",
)
fig.savefig("best_hit_test_example.png")
Thanks again, I am able to customize my own gff file, thanks for the fast response! However, so far I am using closed genomes from genbank, and my own data is made of MAGs (so basically bins with several contigs per bin). I tried to adapt some of your code, using the "sectors" command however without success. Is it possible to plot different sectors with different scaffolds and perform the previous code per scaffold/sector to plot forward, reverse and best hits? Should I open another issue or can we follow up here?
this is one portion of the my real gff:
scaffold_14990_c1 Prodigal_v2.6.3 CDS 3 170 -1.2 - 0 ID=101_1;partial=10;start_type=ATG;rbs_motif=TAA;rbs_spacer=3bp;gc_cont=0.232;conf=50.28;score=0.05;cscore=-4.16;sscore=4.21;rscore=5.34;uscore=-1.29;tscore=3.25;
scaffold_14990_c1 Prodigal_v2.6.3 CDS 163 291 1.7 - 0 ID=101_2;partial=00;start_type=ATG;rbs_motif=TAA;rbs_spacer=6bp;gc_cont=0.171;conf=55.78;score=1.01;cscore=-1.58;sscore=2.59;rscore=1.28;uscore=0.83;tscore=1.64;
scaffold_14990_c1 Prodigal_v2.6.3 CDS 694 984 7.3 - 0 ID=101_3;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.230;conf=82.09;score=6.62;cscore=5.92;sscore=0.70;rscore=-3.80;uscore=1.90;tscore=3.25;
scaffold_14990_c1 Prodigal_v2.6.3 CDS 1421 1879 4.1 - 0 ID=101_4;partial=00;start_type=TTG;rbs_motif=TAA;rbs_spacer=8bp;gc_cont=0.187;conf=72.11;score=4.13;cscore=6.71;sscore=-2.58;rscore=2.54;uscore=-1.04;tscore=-4.08;
scaffold_14990_c1 Prodigal_v2.6.3 CDS 2792 2980 8.3 - 0 ID=101_5;partial=00;start_type=ATG;rbs_motif=TAA;rbs_spacer=3bp;gc_cont=0.339;conf=89.21;score=9.19;cscore=-0.70;sscore=9.89;rscore=3.97;uscore=3.13;tscore=2.42;
scaffold_14990_c1 Prodigal_v2.6.3 CDS 3018 3404 60.2 - 0 ID=101_6;partial=00;start_type=ATG;rbs_motif=TAA;rbs_spacer=14bp;gc_cont=0.266;conf=100.00;score=59.55;cscore=53.15;sscore=6.39;rscore=0.25;uscore=3.54;tscore=3.25;
scaffold_14990_c1 Prodigal_v2.6.3 CDS 4205 4294 3.4 + 0 ID=101_7;partial=00;start_type=ATG;rbs_motif=TAA;rbs_spacer=11bp;gc_cont=0.144;conf=68.62;score=3.40;cscore=0.60;sscore=2.80;rscore=1.16;uscore=0.51;tscore=1.13;
scaffold_14990_c1 Prodigal_v2.6.3 CDS 4284 4397 5.5 + 0 ID=101_8;partial=00;start_type=ATG;rbs_motif=TAA;rbs_spacer=11bp;gc_cont=0.149;conf=82.39;score=6.71;cscore=1.92;sscore=4.79;rscore=1.49;uscore=0.65;tscore=1.44;
scaffold_14990_c1 Prodigal_v2.6.3 CDS 4387 4464 6.7 + 0 ID=101_9;partial=01;start_type=ATG;rbs_motif=TAA;rbs_spacer=11bp;gc_cont=0.141;conf=86.14;score=7.95;cscore=0.28;sscore=7.66;rscore=3.35;uscore=-0.15;tscore=3.25;
scaffold_4679_c1 Prodigal_v2.6.3 CDS 759 1394 62.1 - 0 ID=102_1;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.256;conf=100.00;score=61.44;cscore=59.55;sscore=1.88;rscore=-4.87;uscore=3.94;tscore=3.47;
scaffold_4679_c1 Prodigal_v2.6.3 CDS 1805 2611 94.1 - 0 ID=102_2;partial=00;start_type=ATG;rbs_motif=TAA;rbs_spacer=3bp;gc_cont=0.318;conf=100.00;score=93.45;cscore=85.77;sscore=7.68;rscore=1.85;uscore=3.02;tscore=3.47;
scaffold_4679_c1 Prodigal_v2.6.3 CDS 3252 3431 0.9 + 0 ID=102_3;partial=01;start_type=ATG;rbs_motif=TAA;rbs_spacer=12bp;gc_cont=0.261;conf=55.26;score=0.92;cscore=-2.02;sscore=2.94;rscore=6.07;uscore=-2.25;tscore=3.47;
scaffold_23691_c1 Prodigal_v2.6.3 CDS 1009 1407 36.1 - 0 ID=103_1;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.341;conf=99.99;score=38.47;cscore=34.28;sscore=4.19;rscore=-0.99;uscore=-0.10;tscore=2.90;
scaffold_23691_c1 Prodigal_v2.6.3 CDS 1404 2333 107.0 - 0 ID=103_2;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.349;conf=100.00;score=108.24;cscore=105.82;sscore=2.42;rscore=-0.99;uscore=-0.75;tscore=2.90;
scaffold_23691_c1 Prodigal_v2.6.3 CDS 2335 2802 41.3 - 0 ID=103_3;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.310;conf=99.99;score=40.66;cscore=44.45;sscore=-3.79;rscore=-0.99;uscore=-5.04;tscore=2.90;
scaffold_23691_c1 Prodigal_v2.6.3 CDS 3006 3239 30.3 - 0 ID=103_4;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.359;conf=99.91;score=30.35;cscore=27.13;sscore=3.22;rscore=0.00;uscore=0.00;tscore=3.22;
scaffold_48783_c1 Prodigal_v2.6.3 CDS 53 1150 116.9 - 0 ID=104_1;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.389;conf=100.00;score=116.29;cscore=110.82;sscore=5.47;rscore=-1.19;uscore=3.10;tscore=4.21;
scaffold_48783_c1 Prodigal_v2.6.3 CDS 1450 1875 25.8 + 0 ID=104_2;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.275;conf=99.73;score=25.79;cscore=20.07;sscore=5.72;rscore=-1.19;uscore=2.70;tscore=4.21;
scaffold_61140_c1 Prodigal_v2.6.3 CDS 2 700 83.2 - 0 ID=105_1;partial=10;start_type=ATG;rbs_motif=AATAA;rbs_spacer=14bp;gc_cont=0.328;conf=100.00;score=82.56;cscore=73.67;sscore=8.89;rscore=6.54;uscore=-0.46;tscore=3.47;
scaffold_61140_c1 Prodigal_v2.6.3 CDS 1172 1585 1.1 + 0 ID=105_2;partial=01;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.208;conf=56.57;score=1.15;cscore=6.44;sscore=-5.29;rscore=-4.87;uscore=-3.89;tscore=3.47;
Thank you again!
Perhaps you want to plot CDS, rRNA, tRNA, etc. by assigning contigs/scaffolds to each sector as shown in the figure below?
So far, I have not been able to provide API functionality for users to easily perform this type of plot from Gff/Genbank files. I have been aware for some time that there would be user demand for this functionality, so I am going to take this opportunity to implement new API functionality. If I can implement it soon, I will release pyCirclize v0.3.1 with the new functionality at the end of this week. Please wait a little.
I have newly released pyCirclize v0.3.1.
Please refer to the documentation link below for details on how to plot genomic features of each contig/scaffold in the Gff/Genbank file. https://moshi4.github.io/pyCirclize/circos_plot/#1-3-mycoplasma-alvi
Thank you very much for this very promising tool! I'm managing to plot the examples, and also use gffs downloaded from genbank so far! I would like to know if it would be possible and how to include a custom track, from another content file (for instance a .txt or .tsv file), for example, to show best blast hits for each CDS with other organisms (one color line for bacteria, another color line for virus eg). or would I need to edit the gff directly to do that? Thank you very much