moshi4 / pyGenomeViz

A genome visualization python package for comparative genomics
https://moshi4.github.io/pyGenomeViz
MIT License
271 stars 22 forks source link

Python API changing plot coordinates #32

Closed mcn3159 closed 7 months ago

mcn3159 commented 7 months ago

Hi,

Thanks for writing such a useful for comparative genomics.

I'm running into a strange issue with plotting a range in a gbk file, despite given coordinates to plot, the figure shows genes on a completely different position on the chromosome than the ones given through the dictionary (this happens specifically for strain S107-48). I can't seem to figure out why this is happening, any suggestions? I attached the gbks I'm using for this along with the code below.

gbks.zip

track_size = 40000
location_of_hits={'S47-18': [2583778, 2585941, -1],
 'S107-48': [982415, 984578, 1],
 'RG1': [78420, 80541, -1],
 'S107-86': [1066248, 1068411, -1]}

gv = GenomeViz(
    fig_track_height=0.7,
    link_track_ratio=2.0,
    align_type="center",
    tick_style="bar",
)

gbk_list = [Genbank(f,min_range=location_of_hits[f.split('/')[-1].split('.gbk')[0]][0]-(track_size/2),
                    max_range=location_of_hits[f.split('/')[-1].split('.gbk')[0]][1]+(track_size/2)) 
            for f in gbk_paths]
for gbk in gbk_list:
    track = gv.add_feature_track(gbk.name,size=gbk.range_size,start_pos=gbk.min_range)
    track.add_genbank_features(gbk, label_type="product" if gbk.name==gbk_paths[0].split('/')[-1].split('.gbk')[0] else None,
                               label_handle_func=lambda s : "" if s.startswith("hypothetical") else s,
                               labelvpos="top", labelsize=10,
                               facecolor="skyblue", linewidth=0.5)

with TemporaryDirectory() as tmpdir:
    # Run MMseqs
    align_coords = MMseqs(gbk_list, tmpdir, quiet=True).run()
    print(align_coords)
    # Filter alignment results by 'min_length', 'min_identity' threshold
    align_coords = AlignCoord.filter(align_coords, min_length=0, min_identity=0.9)
    print("After filter", len(align_coords))

min_identity = int(min([ac.identity for ac in align_coords]))

for ac in align_coords:
    link1 = (ac.ref_name, ac.ref_start, ac.ref_end)
    link2 = (ac.query_name, ac.query_start, ac.query_end)
    gv.add_link(link1, link2, curve=True, v=ac.identity, vmin=min_identity)

fig = gv.plotfig()

#gv.set_colorbar(fig, vmin=min_identity)

# Add Legends (Maybe there is a better way)
handles = [
    Line2D([], [], marker=">", color="skyblue", label="CDS", ms=10, ls="none"),
    Patch(color="grey", label="Normal Link"),
    Patch(color="red", label="Inverted Link"),
]
legend = fig.legend(handles=handles, bbox_to_anchor=(1, 1))

Which should output

image

phyto commented 7 months ago

I have not tried to run your code, but S107-48 stand out by being the only one with strand 1 whereas others are strand -1 in your dict location_of_hits. Also in your code you do not seem to use that information (i.e. I do not see in your code sthg like location_of_hits[gbk_short][2].

moshi4 commented 7 months ago

Hi @mcn3159,

Looking at the code and result figure, I don't really understand what is strange. On what data basis did you think pyGenomeViz was plotting completely different chromosome positions? If you can specifically indicate the apparent discrepancies between your assumed results and the actual results, I might be able to suggest some advice.

mcn3159 commented 7 months ago

Hi thanks for getting back to me so soon!

@phyto The problem might be with the lack of strand information given. Is there a way to include that in add_feature_track or add_genbank_features? I didn't see it in the documentation, but I may have missed that.

Hopefully the following indicates where the discrepancies are. I printed the the protein product names I should be seeing in the range of the gbk file given, which is different than what I'm seeing in the figure.

print(gbk_list[0])
for record in SeqIO.parse('mmseq_query_acetylhexosamine/gbk_2_12_24_analyses/S107-48.gbk','genbank'):
  for feature in record.features:
    if (feature.type == 'CDS') and (feature.location.start > int(gbk_list[0].min_range)) and (feature.location.end < int(gbk_list[0].min_range)+int(gbk_list[0].range_size)):
      print(feature.qualifiers.get('product'))

Which outputs

S107-48 (962,415.0 - 1,004,578.0 bp)
['acyltransferase family protein']
['VanZ family protein']
['ABC transporter ATP-binding protein']
['ABC transporter permease']
['cytidylate kinase-like family protein']
['FprA family A-type flavoprotein']
['lantibiotic protection ABC transporter ATP-binding protein']
['lantibiotic immunity ABC transporter MutE/EpiE family permease subunit']
['lantibiotic immunity ABC transporter MutG family permease subunit']
['response regulator transcription factor']
['HAMP domain-containing histidine kinase']
['arsenate reductase family protein']
['ABC-F type ribosomal protection protein']
['ABC transporter ATP-binding protein']
['ABC transporter ATP-binding protein']
['DUF4097 family beta strand repeat protein']
['DUF1700 domain-containing protein']
['PadR family transcriptional regulator']
['response regulator transcription factor']
['sensor histidine kinase']
['1,3-beta-galactosyl-N-acetylhexosamine phosphorylase']
['helix-turn-helix transcriptional regulator']
['LacI family DNA-binding transcriptional regulator']
['sugar ABC transporter permease']
['carbohydrate ABC transporter permease']
['extracellular solute-binding protein']
['glycoside hydrolase family 32 protein']
['chromate transporter']
['chromate transporter']
['Gfo/Idh/MocA family oxidoreductase']
['hypothetical protein']
['dipicolinate synthase subunit B']
['asparagine synthase (glutamine-hydrolyzing)']
['hydroxyethylthiazole kinase']
['thiamine phosphate synthase']
['HAD family phosphatase']
['bifunctional hydroxymethylpyrimidine kinase/phosphomethylpyrimidine kinase']
['Na+/H+ antiporter NhaC family protein']
['C39 family peptidase']
['aspartate--ammonia ligase']
mcn3159 commented 7 months ago

Is there a way to specify what contig/record to use in the Genbank class? I think this would solve my problem

moshi4 commented 7 months ago

There is probably no good solution to your problem in the pyGenomeViz API. It is best to remove non-target records in the Genbank file directly.

mcn3159 commented 7 months ago

Thanks, here's the workaround I found to filter for a specific contig in the pyGenomeViz gbk object.

contig_with_gene_of_interest = 'ctgS1000000F'
records_to_keep = list(filter(lambda x: x.id == contig_with_gene_of_interest,gbk.records)) # filter records in gbk object
gbk._records = records_to_keep
peradastra commented 6 months ago

Also ran into this issue, thanks @mcn3159 for the solution.