moshi4 / pyCirclize

Circular visualization in Python (Circos Plot, Chord Diagram, Radar Chart)
https://moshi4.github.io/pyCirclize/
MIT License
762 stars 48 forks source link

Customize track genomic features #17

Closed LFelipe-B closed 1 year ago

LFelipe-B commented 1 year ago

Thank you very much for this very promising tool! I'm managing to plot the examples, and also use gffs downloaded from genbank so far! I would like to know if it would be possible and how to include a custom track, from another content file (for instance a .txt or .tsv file), for example, to show best blast hits for each CDS with other organisms (one color line for bacteria, another color line for virus eg). or would I need to edit the gff directly to do that? Thank you very much

moshi4 commented 1 year ago

Since pyCirclize is a flexible circular visualization tool, I think what you want to do would be possible. As for how to do it, I have no knowledge of the data content you are assuming, so it is difficult for me to answer your question.

If you could ask more specific questions with simplified data examples and output assumptions diagrams, I may be able to answer.

LFelipe-B commented 1 year ago

Hi moshi4, thanks for your quick reply. Sorry, I should have posted one example at least. Since I am not sure how the actual input for this would look like, although I guess it could be something as a bed file with a custom feature according to the best hit (eukarya, bacteria and so on). I was thinking in something as here: https://moshi4.github.io/pyCirclize/circos_plot/ the example "2. Escherichia coli", having a track with each colored line according to the best hit taxa (for ie. green eukarya, purple bacteria and so on) mapped to the coordinate of each CDS. Sorry if I cannot be more specific, but thank you anyway!

moshi4 commented 1 year ago

If you are able to link the protein_id in GFF's CDS to the best hit information for each taxon, then a plot like the example below is possible. Sorry if this is not what you are assuming.

from pycirclize import Circos
from pycirclize.parser import Gff
from pycirclize.utils import load_prokaryote_example_file
from matplotlib.patches import Patch

# Load GFF file
# https://github.com/moshi4/pycirclize-data/blob/main/prokaryote/enterobacteria_phage.gff
gff_file = load_prokaryote_example_file("enterobacteria_phage.gff")
gff = Gff(gff_file)

# Setup circos sector & track
circos = Circos(sectors={gff.name: gff.range_size})
sector = circos.sectors[0]
cds_track = sector.add_track((90, 100))
cds_track.axis(fc="#EEEEEE", ec="none")

# Best hit `protein_id` dummy results
eukaryote_best_hits = ["NP_050510.1", "NP_050507.1", "NP_050518.1", "NP_050526.1", "NP_050550.1", "NP_050579.1", "NP_050532.1", "NP_050503.1", "NP_050561.1", "NP_050574.1"]
prokaryote_best_hits = ["NP_050501.1", "NP_050546.1", "NP_050569.1", "NP_050506.1", "NP_050578.1", "NP_050544.1", "NP_050558.1", "NP_050547.1", "NP_050540.1", "NP_050534.1"]

for feature in gff.extract_features("CDS"):
    # Assign color based on best hit `protein_id` dummy results
    protein_id = feature.qualifiers.get("protein_id", [None])[0]
    if protein_id in eukaryote_best_hits:
        facecolor = "green"
    elif protein_id in prokaryote_best_hits:
        facecolor = "purple"
    else:
        facecolor = "lightgrey"
    # Plot CDS
    r_lim = (95, 100) if feature.strand == 1 else (90, 95)
    cds_track.genomic_features([feature], plotstyle="arrow", r_lim=r_lim, fc=facecolor, lw=0.5)

# Save figure
fig = circos.plotfig()
_ = circos.ax.legend(
    handles=[
        Patch(color="green", label="Eukaryote Best Hit"),
        Patch(color="purple", label="Prokaryote Best Hit"),
    ],
    bbox_to_anchor=(0.5, 0.5),
    loc="center",
)
fig.savefig("best_hit_example.png")

best_hit_example.png best_hit_example

LFelipe-B commented 1 year ago

Thank you looks great! I tried to customize a plot with your example code mixed with your codes from other examples. It's almost there. Thank you so much again! Here is my code:

# Load GFF file
from pycirclize import Circos
from pycirclize.utils import load_prokaryote_example_file
from pycirclize.parser import Gff
import matplotlib # to plot figures
import matplotlib.pyplot as plt
from matplotlib.patches import Patch

gff_file = load_prokaryote_example_file("enterobacteria_phage.gff")
gff = Gff(gff_file)

circos = Circos(sectors={gff.name: gff.range_size})
circos.text("Enterobacteria phage\n(NC_000902)", size=15)
sector = circos.sectors[0]

outer_track = sector.add_track((99.6, 100)) # thickness of outer ring "99.7"
outer_track.axis(fc="black")

outer_track.xticks_by_interval(5000, label_formatter=lambda v: f"{v / 1000:.0f} Kb")
outer_track.xticks_by_interval(1000, tick_length=1, show_label=False)

f_cds_track = sector.add_track((88, 98), r_pad_ratio=0.1) # size of line changed to 98 before was 95
f_cds_track.genomic_features(gff.extract_features("CDS", target_strand=1),  fc="#579bc4")

r_cds_track = sector.add_track((78, 88), r_pad_ratio=0.1)
r_cds_track.genomic_features(gff.extract_features("CDS", target_strand=-1), fc="red")

cds_track = sector.add_track((68,78))
cds_track.axis(fc="white", ec="none")

eukaryote_best_hits = ["NP_050510.1", "NP_050507.1", "NP_050518.1", "NP_050526.1", "NP_050550.1", "NP_050579.1", "NP_050532.1", "NP_050503.1", "NP_050561.1", "NP_050574.1"]
prokaryote_best_hits = ["NP_050501.1", "NP_050546.1", "NP_050569.1", "NP_050506.1", "NP_050578.1", "NP_050544.1", "NP_050558.1", "NP_050547.1", "NP_050540.1", "NP_050534.1"]

for feature in gff.extract_features("CDS"):
    # Assign color based on best hit `protein_id` dummy results
    protein_id = feature.qualifiers.get("protein_id", [None])[0]
    if protein_id in eukaryote_best_hits:
        facecolor = "green"
    elif protein_id in prokaryote_best_hits:
        facecolor = "purple"
    else:
        facecolor = "lightgrey"

    r_lim = (68,78) if feature.strand == 1 else (68,78)
    cds_track.genomic_features([feature], r_lim=r_lim, fc=facecolor)

fig = circos.plotfig()
_ = circos.ax.legend(
    handles=[
        Patch(color="green", label="Eukaryote Best Hit"),
        Patch(color="purple", label="Prokaryote Best Hit"),
        Patch(color="gray", label="Virus Best Hit"),
    ],
    bbox_to_anchor=(0.5, 0.5),
    loc="lower left",
)
fig.savefig("best_hit_test_example.png")

best_hit_test_example

LFelipe-B commented 1 year ago

Thanks again, I am able to customize my own gff file, thanks for the fast response! However, so far I am using closed genomes from genbank, and my own data is made of MAGs (so basically bins with several contigs per bin). I tried to adapt some of your code, using the "sectors" command however without success. Is it possible to plot different sectors with different scaffolds and perform the previous code per scaffold/sector to plot forward, reverse and best hits? Should I open another issue or can we follow up here?

this is one portion of the my real gff:

scaffold_14990_c1   Prodigal_v2.6.3 CDS 3   170 -1.2    -   0   ID=101_1;partial=10;start_type=ATG;rbs_motif=TAA;rbs_spacer=3bp;gc_cont=0.232;conf=50.28;score=0.05;cscore=-4.16;sscore=4.21;rscore=5.34;uscore=-1.29;tscore=3.25;
scaffold_14990_c1   Prodigal_v2.6.3 CDS 163 291 1.7 -   0   ID=101_2;partial=00;start_type=ATG;rbs_motif=TAA;rbs_spacer=6bp;gc_cont=0.171;conf=55.78;score=1.01;cscore=-1.58;sscore=2.59;rscore=1.28;uscore=0.83;tscore=1.64;
scaffold_14990_c1   Prodigal_v2.6.3 CDS 694 984 7.3 -   0   ID=101_3;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.230;conf=82.09;score=6.62;cscore=5.92;sscore=0.70;rscore=-3.80;uscore=1.90;tscore=3.25;
scaffold_14990_c1   Prodigal_v2.6.3 CDS 1421    1879    4.1 -   0   ID=101_4;partial=00;start_type=TTG;rbs_motif=TAA;rbs_spacer=8bp;gc_cont=0.187;conf=72.11;score=4.13;cscore=6.71;sscore=-2.58;rscore=2.54;uscore=-1.04;tscore=-4.08;
scaffold_14990_c1   Prodigal_v2.6.3 CDS 2792    2980    8.3 -   0   ID=101_5;partial=00;start_type=ATG;rbs_motif=TAA;rbs_spacer=3bp;gc_cont=0.339;conf=89.21;score=9.19;cscore=-0.70;sscore=9.89;rscore=3.97;uscore=3.13;tscore=2.42;
scaffold_14990_c1   Prodigal_v2.6.3 CDS 3018    3404    60.2    -   0   ID=101_6;partial=00;start_type=ATG;rbs_motif=TAA;rbs_spacer=14bp;gc_cont=0.266;conf=100.00;score=59.55;cscore=53.15;sscore=6.39;rscore=0.25;uscore=3.54;tscore=3.25;
scaffold_14990_c1   Prodigal_v2.6.3 CDS 4205    4294    3.4 +   0   ID=101_7;partial=00;start_type=ATG;rbs_motif=TAA;rbs_spacer=11bp;gc_cont=0.144;conf=68.62;score=3.40;cscore=0.60;sscore=2.80;rscore=1.16;uscore=0.51;tscore=1.13;
scaffold_14990_c1   Prodigal_v2.6.3 CDS 4284    4397    5.5 +   0   ID=101_8;partial=00;start_type=ATG;rbs_motif=TAA;rbs_spacer=11bp;gc_cont=0.149;conf=82.39;score=6.71;cscore=1.92;sscore=4.79;rscore=1.49;uscore=0.65;tscore=1.44;
scaffold_14990_c1   Prodigal_v2.6.3 CDS 4387    4464    6.7 +   0   ID=101_9;partial=01;start_type=ATG;rbs_motif=TAA;rbs_spacer=11bp;gc_cont=0.141;conf=86.14;score=7.95;cscore=0.28;sscore=7.66;rscore=3.35;uscore=-0.15;tscore=3.25;
scaffold_4679_c1    Prodigal_v2.6.3 CDS 759 1394    62.1    -   0   ID=102_1;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.256;conf=100.00;score=61.44;cscore=59.55;sscore=1.88;rscore=-4.87;uscore=3.94;tscore=3.47;
scaffold_4679_c1    Prodigal_v2.6.3 CDS 1805    2611    94.1    -   0   ID=102_2;partial=00;start_type=ATG;rbs_motif=TAA;rbs_spacer=3bp;gc_cont=0.318;conf=100.00;score=93.45;cscore=85.77;sscore=7.68;rscore=1.85;uscore=3.02;tscore=3.47;
scaffold_4679_c1    Prodigal_v2.6.3 CDS 3252    3431    0.9 +   0   ID=102_3;partial=01;start_type=ATG;rbs_motif=TAA;rbs_spacer=12bp;gc_cont=0.261;conf=55.26;score=0.92;cscore=-2.02;sscore=2.94;rscore=6.07;uscore=-2.25;tscore=3.47;
scaffold_23691_c1   Prodigal_v2.6.3 CDS 1009    1407    36.1    -   0   ID=103_1;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.341;conf=99.99;score=38.47;cscore=34.28;sscore=4.19;rscore=-0.99;uscore=-0.10;tscore=2.90;
scaffold_23691_c1   Prodigal_v2.6.3 CDS 1404    2333    107.0   -   0   ID=103_2;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.349;conf=100.00;score=108.24;cscore=105.82;sscore=2.42;rscore=-0.99;uscore=-0.75;tscore=2.90;
scaffold_23691_c1   Prodigal_v2.6.3 CDS 2335    2802    41.3    -   0   ID=103_3;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.310;conf=99.99;score=40.66;cscore=44.45;sscore=-3.79;rscore=-0.99;uscore=-5.04;tscore=2.90;
scaffold_23691_c1   Prodigal_v2.6.3 CDS 3006    3239    30.3    -   0   ID=103_4;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.359;conf=99.91;score=30.35;cscore=27.13;sscore=3.22;rscore=0.00;uscore=0.00;tscore=3.22;
scaffold_48783_c1   Prodigal_v2.6.3 CDS 53  1150    116.9   -   0   ID=104_1;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.389;conf=100.00;score=116.29;cscore=110.82;sscore=5.47;rscore=-1.19;uscore=3.10;tscore=4.21;
scaffold_48783_c1   Prodigal_v2.6.3 CDS 1450    1875    25.8    +   0   ID=104_2;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.275;conf=99.73;score=25.79;cscore=20.07;sscore=5.72;rscore=-1.19;uscore=2.70;tscore=4.21;
scaffold_61140_c1   Prodigal_v2.6.3 CDS 2   700 83.2    -   0   ID=105_1;partial=10;start_type=ATG;rbs_motif=AATAA;rbs_spacer=14bp;gc_cont=0.328;conf=100.00;score=82.56;cscore=73.67;sscore=8.89;rscore=6.54;uscore=-0.46;tscore=3.47;
scaffold_61140_c1   Prodigal_v2.6.3 CDS 1172    1585    1.1 +   0   ID=105_2;partial=01;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.208;conf=56.57;score=1.15;cscore=6.44;sscore=-5.29;rscore=-4.87;uscore=-3.89;tscore=3.47;

Thank you again!

moshi4 commented 1 year ago

Perhaps you want to plot CDS, rRNA, tRNA, etc. by assigning contigs/scaffolds to each sector as shown in the figure below?

contigs_plot

So far, I have not been able to provide API functionality for users to easily perform this type of plot from Gff/Genbank files. I have been aware for some time that there would be user demand for this functionality, so I am going to take this opportunity to implement new API functionality. If I can implement it soon, I will release pyCirclize v0.3.1 with the new functionality at the end of this week. Please wait a little.

moshi4 commented 1 year ago

I have newly released pyCirclize v0.3.1.

Please refer to the documentation link below for details on how to plot genomic features of each contig/scaffold in the Gff/Genbank file. https://moshi4.github.io/pyCirclize/circos_plot/#1-3-mycoplasma-alvi