pycogent / pycogent

PyCogent: Official repository for software and unit tests
http://www.pycogent.org/
89 stars 52 forks source link

cogent ensembl, exon.Location/transcript.Location represents wrong data #90

Closed vladie0 closed 3 years ago

vladie0 commented 9 years ago

Dear,

While using your cogent plugin to fetch data from the ENSEMBL database I encoutered an error. When retrieving exon locations, trough exon.Location, the locations fetched are not corresponding to the exon coordinates on ENSEMBL (occured using ensembl version 81).

Does anyone has a solution for this problem ? thanks in advance

vladie0 commented 9 years ago

Upon further investigation I noticed that the same error occurs when retrieving the transcript location trough transcript.Location. I also found the source of the problem: the source code automatically substracts the transcripts 5-UTR length from the start - end position of exons/transcripts, which results in wrongly represented start/stop position. I'll start looking into the source code for a fix

GavinHuttley commented 9 years ago

please provide an example, with information from Ensembl’s web page of their coordinates.

On 9 Sep 2015, at 2:15 am, vladie0 notifications@github.com wrote:

Upon further investigation I noticed that the same error occurs when retrieving the transcript location trough transcript.Location. I also found the source of the problem: the source code automatically substracts the transcripts 5-UTR length from the start - end position, which results in wrongly represented start/stop position. I'll start looking into the source code for a fix

— Reply to this email directly or view it on GitHub.

vladie0 commented 9 years ago

Below transcript location and exon locations as fetched by pyCogent are represented (ensembl version 81), At the end of this post you can find 2 links to de ensembl webpage concerning the transcript and their exons.: ENST00000374316 Homo sapiens:chromosome:6:33620364-33696574:1 Exon(StableId=ENSE00003696763, Rank=1) Homo sapiens:chromosome:6:33620364-33620436:1 Exon(StableId=ENSE00001463444, Rank=2) Homo sapiens:chromosome:6:33620614-33621691:1 Exon(StableId=ENSE00001625407, Rank=3) Homo sapiens:chromosome:6:33640483-33640554:1 Exon(StableId=ENSE00000741325, Rank=4) Homo sapiens:chromosome:6:33655765-33655887:1 Exon(StableId=ENSE00000741289, Rank=5) Homo sapiens:chromosome:6:33657931-33658018:1 Exon(StableId=ENSE00000741249, Rank=6) Homo sapiens:chromosome:6:33658669-33658828:1 Exon(StableId=ENSE00000849403, Rank=7) Homo sapiens:chromosome:6:33659020-33659119:1 Exon(StableId=ENSE00000741171, Rank=8) Homo sapiens:chromosome:6:33659465-33659549:1 Exon(StableId=ENSE00000849405, Rank=9) Homo sapiens:chromosome:6:33662527-33662674:1 Exon(StableId=ENSE00000741076, Rank=10) Homo sapiens:chromosome:6:33662910-33663006:1 Exon(StableId=ENSE00000741032, Rank=11) Homo sapiens:chromosome:6:33663499-33663550:1 Exon(StableId=ENSE00000740961, Rank=12) Homo sapiens:chromosome:6:33663737-33663880:1 Exon(StableId=ENSE00000849409, Rank=13) Homo sapiens:chromosome:6:33664869-33664969:1 Exon(StableId=ENSE00000740872, Rank=14) Homo sapiens:chromosome:6:33665052-33665213:1 Exon(StableId=ENSE00000849411, Rank=15) Homo sapiens:chromosome:6:33665834-33665976:1 Exon(StableId=ENSE00000740768, Rank=16) Homo sapiens:chromosome:6:33667128-33667290:1 Exon(StableId=ENSE00001902953, Rank=17) Homo sapiens:chromosome:6:33667791-33667964:1 Exon(StableId=ENSE00000740692, Rank=18) Homo sapiens:chromosome:6:33668514-33668634:1 Exon(StableId=ENSE00000740661, Rank=19) Homo sapiens:chromosome:6:33668973-33669156:1 Exon(StableId=ENSE00000740608, Rank=20) Homo sapiens:chromosome:6:33670324-33670576:1 Exon(StableId=ENSE00000849417, Rank=21) Homo sapiens:chromosome:6:33670670-33670815:1 Exon(StableId=ENSE00000849418, Rank=22) Homo sapiens:chromosome:6:33671164-33671306:1 Exon(StableId=ENSE00000849419, Rank=23) Homo sapiens:chromosome:6:33672028-33672228:1 Exon(StableId=ENSE00000740502, Rank=24) Homo sapiens:chromosome:6:33673590-33673720:1 Exon(StableId=ENSE00000740467, Rank=25) Homo sapiens:chromosome:6:33674207-33674265:1 Exon(StableId=ENSE00000740429, Rank=26) Homo sapiens:chromosome:6:33675690-33675856:1 Exon(StableId=ENSE00000740405, Rank=27) Homo sapiens:chromosome:6:33676767-33676932:1 Exon(StableId=ENSE00000740371, Rank=28) Homo sapiens:chromosome:6:33677014-33677089:1 Exon(StableId=ENSE00000740349, Rank=29) Homo sapiens:chromosome:6:33677503-33677629:1 Exon(StableId=ENSE00000849425, Rank=30) Homo sapiens:chromosome:6:33678420-33678543:1 Exon(StableId=ENSE00000740286, Rank=31) Homo sapiens:chromosome:6:33678638-33678839:1 Exon(StableId=ENSE00000740255, Rank=32) Homo sapiens:chromosome:6:33679881-33680133:1 Exon(StableId=ENSE00000849428, Rank=33) Homo sapiens:chromosome:6:33680328-33680454:1 Exon(StableId=ENSE00000849429, Rank=34) Homo sapiens:chromosome:6:33680554-33680680:1 Exon(StableId=ENSE00000849430, Rank=35) Homo sapiens:chromosome:6:33682523-33682644:1 Exon(StableId=ENSE00000740102, Rank=36) Homo sapiens:chromosome:6:33683206-33683397:1 Exon(StableId=ENSE00000849432, Rank=37) Homo sapiens:chromosome:6:33684019-33684168:1 Exon(StableId=ENSE00000739995, Rank=38) Homo sapiens:chromosome:6:33684356-33684465:1 Exon(StableId=ENSE00000849434, Rank=39) Homo sapiens:chromosome:6:33684597-33684688:1 Exon(StableId=ENSE00000849435, Rank=40) Homo sapiens:chromosome:6:33684773-33684943:1 Exon(StableId=ENSE00000739793, Rank=41) Homo sapiens:chromosome:6:33685358-33685533:1 Exon(StableId=ENSE00000739740, Rank=42) Homo sapiens:chromosome:6:33685642-33685827:1 Exon(StableId=ENSE00000739679, Rank=43) Homo sapiens:chromosome:6:33686052-33686253:1 Exon(StableId=ENSE00000849439, Rank=44) Homo sapiens:chromosome:6:33686408-33686519:1 Exon(StableId=ENSE00000739573, Rank=45) Homo sapiens:chromosome:6:33687008-33687104:1 Exon(StableId=ENSE00000739476, Rank=46) Homo sapiens:chromosome:6:33687225-33687327:1 Exon(StableId=ENSE00000739410, Rank=47) Homo sapiens:chromosome:6:33687477-33687564:1 Exon(StableId=ENSE00000739360, Rank=48) Homo sapiens:chromosome:6:33688056-33688167:1 Exon(StableId=ENSE00000849443, Rank=49) Homo sapiens:chromosome:6:33688238-33688431:1 Exon(StableId=ENSE00000849444, Rank=50) Homo sapiens:chromosome:6:33688655-33688781:1 Exon(StableId=ENSE00000739123, Rank=51) Homo sapiens:chromosome:6:33689237-33689410:1 Exon(StableId=ENSE00000849446, Rank=52) Homo sapiens:chromosome:6:33690033-33690198:1 Exon(StableId=ENSE00000849447, Rank=53) Homo sapiens:chromosome:6:33690916-33691109:1 Exon(StableId=ENSE00000739023, Rank=54) Homo sapiens:chromosome:6:33691614-33691719:1 Exon(StableId=ENSE00000849449, Rank=55) Homo sapiens:chromosome:6:33691800-33691928:1 Exon(StableId=ENSE00001892279, Rank=56) Homo sapiens:chromosome:6:33692727-33692893:1 Exon(StableId=ENSE00000849451, Rank=57) Homo sapiens:chromosome:6:33693544-33693705:1 Exon(StableId=ENSE00000738920, Rank=58) Homo sapiens:chromosome:6:33694923-33695085:1 Exon(StableId=ENSE00001463151, Rank=59) Homo sapiens:chromosome:6:33695711-33696574:1 ENST00000281543 Homo sapiens:chromosome:4:44678428-44700926:1 Exon(StableId=ENSE00001001895, Rank=1) Homo sapiens:chromosome:4:44678428-44678787:1 Exon(StableId=ENSE00001001898, Rank=2) Homo sapiens:chromosome:4:44680440-44680552:1 Exon(StableId=ENSE00003548079, Rank=3) Homo sapiens:chromosome:4:44680693-44680842:1 Exon(StableId=ENSE00003539262, Rank=4) Homo sapiens:chromosome:4:44681122-44681203:1 Exon(StableId=ENSE00003654760, Rank=5) Homo sapiens:chromosome:4:44682333-44682411:1 Exon(StableId=ENSE00003534248, Rank=6) Homo sapiens:chromosome:4:44683234-44683318:1 Exon(StableId=ENSE00003640763, Rank=7) Homo sapiens:chromosome:4:44685958-44686023:1 Exon(StableId=ENSE00003563457, Rank=8) Homo sapiens:chromosome:4:44686509-44686713:1 Exon(StableId=ENSE00003613581, Rank=9) Homo sapiens:chromosome:4:44688006-44688146:1 Exon(StableId=ENSE00003484611, Rank=10) Homo sapiens:chromosome:4:44689285-44689409:1 Exon(StableId=ENSE00003562070, Rank=11) Homo sapiens:chromosome:4:44689842-44689975:1 Exon(StableId=ENSE00003554628, Rank=12) Homo sapiens:chromosome:4:44690716-44690860:1 Exon(StableId=ENSE00003498615, Rank=13) Homo sapiens:chromosome:4:44691665-44691799:1 Exon(StableId=ENSE00003605714, Rank=14) Homo sapiens:chromosome:4:44694411-44694513:1 Exon(StableId=ENSE00003628882, Rank=15) Homo sapiens:chromosome:4:44695614-44695734:1 Exon(StableId=ENSE00003507099, Rank=16) Homo sapiens:chromosome:4:44697407-44697444:1 Exon(StableId=ENSE00001326320, Rank=17) Homo sapiens:chromosome:4:44698543-44700926:1

ENST00000374316 ENSEMBL webpage: http://www.ensembl.org/Homo_sapiens/Transcript/Exons?db=core;g=ENSG00000096433;r=6:33620365-33696574;t=ENST00000374316

ENST00000281543 ENSEMBL webpage: http://www.ensembl.org/Homo_sapiens/Transcript/Exons?db=core;g=ENSG00000151806;r=4:44678429-44700926;t=ENST00000281543

vladie0 commented 9 years ago

I realized that pycogent uses another coordination system then the ENSEMBL MySQL database, is there a way to automatically fetch ENSEMBL MySQL coordinates instead of pycogent internal coordinates ?

GavinHuttley commented 9 years ago

The Location attribute is a cogent Coordinate. This class has EnsemblStart and EnsemblEnd attributes.

On 9 Sep 2015, at 6:30 pm, vladie0 notifications@github.com wrote:

I realized that pycogent uses another coordination system then the ENSEMBL MySQL database, is there a way to automatically fetch ENSEMBL MySQL coordinates instead of pycogent internal coordinates ?

— Reply to this email directly or view it on GitHub.

vladie0 commented 9 years ago

When using transcript.Location.EnsemblStart, the pyCogent start cogent is fetched, however this does not correspond to the ENSEMBL start site as represented on ENSEMBL website. Is their a way to directly fetch the ENSEMBL coordinates as reprenstend on their website ( their MYSQL database)? Currently i'm creating a work-a-round where the coordinates are fetched directly from the ENSEMBL core DB. An example is given below: ENST00000281543: Location.EnsemblStart: 44678429 Location.EnsemblEnd: 44700926 Coordinates on ENSEMBL webpage: Chromosome 4: 44,680,446-44,702,943 link: http://grch37.ensembl.org/Homo_sapiens/Transcript/Exons?db=core;g=ENSG00000151806;r=4:44680446-44702943;t=ENST00000281543

GavinHuttley commented 9 years ago

I've asked a colleague to look at this. Their comment is:

The website link provided points to an archived version of Ensembl database, while the pycogent query fetched the latest dataset.

Attached is a simple script to query lastest human ensembl database through pycogent, which gives the output as

“”” Lasted Ensembl release = 81 Gene(Species='Homo sapiens'; BioType='protein_coding'; Description='GUF1 GTPase homolog...'; StableId='ENSG00000151806'; Status='KNOWN'; Symbol='GUF1') Location: Homo sapiens:chromosome:4:44678426-44700926:1 “””

This gives the same coordinate as ensembl website: http://asia.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000151806;r=4:44678427-44700926

vladie0 commented 9 years ago

You are correct about pycogent providing the correct gene location, with the code specified above. However the issue resides in the location of transcripts and exons, which is incorrect, even with the newest ensembl version. At this point I circumvented the issue by manually connecting to the ensembl coredb exon/transcript database and importing locations from there

HuaEmilyYing commented 9 years ago

Can you specify some examples of inconsistency between pycogent transcript/exon locations, the corresponding ensemble links, and pycogent code.

Attached is the script following your question about transcript and exon locations. In this example, there are four transcripts from GUF1 genes and each of them has multiple axons. The output file listed all the transcripts and their axons. The ensemble website for those transcripts are:

1) GUF1-001, 17 exons http://asia.ensembl.org/Homo_sapiens/Transcript/Exons?db=core;g=ENSG00000151806;r=4:44678427-44700926;t=ENST00000281543 2) GUF1-002, 16 exons http://asia.ensembl.org/Homo_sapiens/Transcript/Exons?db=core;g=ENSG00000151806;r=4:44678427-44700926;t=ENST00000506793 3) GUF1-005, 9 exons http://asia.ensembl.org/Homo_sapiens/Transcript/Exons?db=core;g=ENSG00000151806;r=4:44678427-44700926;t=ENST00000513775 4) GUF1-010, 3 exons http://asia.ensembl.org/Homo_sapiens/Transcript/Exons?db=core;g=ENSG00000151806;r=4:44678427-44700926;t=ENST00000511493

As you've noticed, pycogent uses "0" based index, while ensemble uses "1" based index. Other than that, the coordinates are the same.

test_coords py output
GavinHuttley commented 3 years ago

cogent3 is the port of PyCogent to cogent3. Please see the cogent3 docs for a description of how this is differs from PyCogent.

For the ensembl querying, please see https://github.com/cogent3/ensembldb3