legumeinfo / glycinemine

An InterMine for Glycine species
GNU Lesser General Public License v3.0
0 stars 1 forks source link

missing syntenic regions? #17

Closed adf-ncgr closed 5 years ago

adf-ncgr commented 5 years ago

this seems to be true in more mines than just soymine. In particular, it is true in legumemine where you'd expect them to be if anywhere. I think an argument can be made for having soymine at least contain the syntenic regions from the self-comparison, even if storing comparison to non-soy genomes doesn't prove to be a sensible thing in the species mine. Note this came up from @cann0010 list of desired templates.

sammyjava commented 5 years ago

I've only been loading synteny on LegumeMine. Sort of a value-added feature. Happy to load it where appropriate on other mines.

sammyjava commented 5 years ago

The lack of synteny on LegumeMine is a loading bug. Like, I forgot. :)

adf-ncgr commented 5 years ago

OK, I guess you can load them into legumemine when you are back and we can discuss the potential utility of loading at least self-synteny blocks for the species mines; soybean is the most interesting case due to the more recent whole genome duplication. also we shouldn't forget cases of multi-genome species like chickpea that will be increasingly common.

It might be fun and instructive to see whether we could make use of GCV services to simply compute them at load time with a fixed set of macro-synteny parameters. The only downside I can see to that approach is that we wouldn't get the "added value" of the median Ks values; a possible up-side to that approach would be the theoretical ability to actually get the gene pairings underlying the block (ie the "syntelogs"). If you feel shy about relying on some external service for this, I could imagine adapting the relevant GCV service code to be callable against genomes provided from an intermine-backed data source. @alancleary is all about getting away from total reliance on chado as the back end.

sammyjava commented 5 years ago

I'm totally down with relying on external REST APIs and such. I´m not down with having to install local server-side apps, because that makes installing a mine harder, i.e. more than just installing the mine.

I don't think median Ks has much particular value. I doubt anyone really cares.

adf-ncgr commented 5 years ago

OK, REST assured I won't force you to do anything you're not down with; I was actually thinking less about a local server-side app than as a bit of code that would be used as part of the loader (more complicated than determining spanned genes, but in principle just some code that runs as a post-processor). I guess that might mean implementing it in java or something. But in your case (since the community has already set up a GCV and loaded all your genomes and adhered faithfully to all your naming conventions), it might not provide any advantage over calling the web service.

regarding median Ks, I agree that it takes a special person to be interested, but would note that in the case of soybean it is basically how synteny blocks from the recent whole genome duplication are discriminated from the old WGD that is shared by most of the legumes, so it has some interest but better still might be to get access to the per-gene-pair Ks values and load those explicitly.

are you running out of things to keep you busy in Sweden? I keep thinking your vacation is the perfect opportunity for me to have the last word on something (until you return) ;)

alancleary commented 5 years ago

I like this idea, though I think you're right @sammyjava that requiring the installation of another server-side app is too much to ask. Perhaps this is an opportunity to explore a micro-services architecture that utilizes service discovery, e.g., you tell the mine about the service registry when you spin it up, it registers itself, and then it leverages relevant services, such as the macro-synteny service, if available.

It's also worth noting that I've been meaning to encapsulate the macro-synteny algorithm itself in a more performant / portable program. So if you do decide to leverage it, via micro-services or otherwise, it might be worth combining our efforts on that front.

adf-ncgr commented 5 years ago

you had me at service discovery (but we'll have to discuss what we all mean by this further- obviously)

sammyjava commented 5 years ago

Yeah sounds a little fancy for my crude education. But if I can code a JSP plus controller to get the data and do something with it, sure. Beware of overdesign.

sammyjava commented 5 years ago

I don't have any GFFs for same-genome synteny. Do they exist? On the data store? I'll bump this over to you / The Synteny Team to make those available, at least for soybean.

adf-ncgr commented 5 years ago

For soybean the data is here: https://legumeinfo.org/data/public/Glycine_max/Wm82.gnm2.ann1.syn1.HXNY/glyma.Wm82.gnm2.ann1.x.glyma.Wm82.gnm2.ann1.recent_duplication.gff.gz https://legumeinfo.org/data/public/Glycine_max/Wm82.gnm2.ann1.syn1.HXNY/glyma.Wm82.gnm2.ann1.x.glyma.Wm82.gnm2.ann1.old_duplication.gff.gz

these appear to be closer to the current naming conventions than what I had used way-back-when to load into chado, but there may be some other aspects of the structure that will be different than what you are used to, so if you encounter trouble let's discuss further. Reassignment ball is back in your court...