Doc about loading data?

abretaud commented 6 years ago

Hi, I'd like to use this module, but I have trouble understanding how to load data into chado. Is this done by the module (where?) or does it need to be done from command line (how?)? Any help would be greatly appreciated!

adf-ncgr commented 6 years ago

Hi, thanks for your interest! The scripts that we've been using for data loading in our project were started as part of an earlier effort (prior to our adopting tripal); nevertheless, they may still be useful to you so I just added them to this repository (see the scripts directory). They aren't particularly well documented, but are probably reasonably straightforward- if not, please feel free to let us know if you get stuck and we'd be happy to help.

I should note that the version on the default branch (lis_master) is what we're using for our mainline project development; there is another branch (master) that contains code that @spficklin had written for a pure-php based loader that was more in keeping with the way that most core tripal modules work, but we haven't really done much with that since, and I'm not entirely clear on whether a version of this has been included as part of the tripal core in the new v3 release (we haven't tackled that upgrade yet). sorry that this situation is a bit confusing, and if it turns out that you have trouble using the lis_branch due to project-specific code that may have crept in, do let us know and we can try to help resolve.

thanks again

abretaud commented 6 years ago

Hi,

Thanks for the scripts, if possible I think I'll use the tripal web loaders, but they could be useful anyway, at least for testing

I've looked at the tripal code, and I found some data loaders that seem to be what I'm looking for. From what I understood:

in tripal2, there's a tripal_phylogeny distributed in the official tripal code (https://github.com/tripal/tripal/tree/7.x-2.x/tripal_phylogeny)
in tripal3, this tripal_phylogeny module is available as a legacy module (https://github.com/tripal/tripal/tree/7.x-3.x/legacy/tripal_phylogeny) AND also included in the tripal3 code (https://github.com/tripal/tripal/blob/7.x-3.x/tripal_chado/api/modules/tripal_chado.phylotree.api.inc)

I guess there is some overlap between this tripal_phylogeny code and your phylotree module, but I'm not sure of what's specific to each. Is it just the way data is displayed or are they completely unrelated?

From a very quick test it looks like it's not possible to enable both tripal_phylogeny and tripal_phylotree at the same time.

adf-ncgr commented 6 years ago

Hi again- sorry for the slow response. The tripal_phylogeny module is derived from this one; I think what our README.md says about it is fairly accurate: "The Tripal Phylogeny module, http://tripal.info/extensions/modules/phylogeny, was initially based on this code. Tripal Phylogeny is more generic, and supports taxonomy trees as well. This LIS phylotree module is more oriented towards client side javascript features, for example using the BioJS MSA viewer." I'm not sure when the next tripal developer call is supposed to take place, but it might be a good topic for the agenda; if I recall correctly, we had basically decided we'd defer deciding how best to deal with the divergence between the two versions of the code until it became clear that some other group was interested in trying to use it. I guess now's the time...

abretaud commented 6 years ago

Hi, Sorry for my slow response too, I keep on getting distracted by other urgent things :/ Anyway, this phylogeny thing is coming at the top of my todo list for the coming weeks

I'm starting to do some tests with a dockerized tripal and see how each module work/interact As I'm also trying to switch to tripal3 (and update python-tripal accordingly), it takes a bit a time, but I'll let you know when I have some news.

My goal is to use this phylogeny module together with the context viewer for insect genomes at http://bipaa.genouest.org

thanks

adf-ncgr commented 6 years ago

Sounds good- FYI I have done a little bit of work using the context viewer on insect genomes. In those cases I just used TreeFam families to do gene family assignments. There's one example here using Drosophila: https://github.com/legumeinfo/lis_context_viewer/wiki/Examples#segmental-rearrangements can give you some more details if you want to try to go that route for starters.

you may also be interested in another project that connects up to tripal_phylotree, which is here: https://github.com/legumefederation/lorax this service basically allows users to add new sequences to pre-existing gene families and re-make the trees, using tripal_phylotree to display the results. We make use of the "lorax" service from (yet another module) https://github.com/legumeinfo/tripal_funnotate which gives a front end that also does some functional annotation (e.g. running iprscan), but in principle the tree-building doesn't depend on that.

in any case, if you run into trouble or see room for improvements, you know where to find us, and regardless of which repository your issue gets placed in we'll be happy to try to help ;)

bradfordcondon commented 6 years ago

I'm considering adopting the visualizations in this module for HWG. I'd be curious how this has worked out for you @abretaud

I'm really interested in adapting the phytozome family trees you have at for example https://legumeinfo.org/chado_phylotree/phytozome_10_2.59026828

The gene families at LegumeInfo were built on the Phytozome 10.2 Angiosperm-level gene family models. Sequences from each species were placed in families based on best Hidden Markov Model match (using hmmsearch from the Hmmer package, v 3.1b2), with a minimum E-value match threshold of 0.1. Sequences in each family were realigned to the family's HMM using hmmalign, and then trimmed to include only match-state characters. Trees were generated using FastTree, and descriptors for the families were created using AHRD (Automatic assignment of human readable descriptions) on the consensus representation of the family generated with hmmemit.

It looks like we should be using the tripal core phylotree module to load the data and/or the perl scripts you have provided, and perhaps adapting this module for the js visualization?

Thanks

Bradford

abretaud commented 6 years ago

Hi, Sorry for not keeping you all updated, (lack of time as usual..). I'm actively working on this these days, I have may things still on my hdd that I will put online very soon.

I'm nearly done with loading the output from some orthofinder runs.

To load data, I think it could be possible to use the core phylotree module but I don't do it this way because there is no way to enable both the core module and this LIS module. That's a problem to automate things.

As I didn't want to add yet another dependency to my system, I chose to port the perl scripts to python and integrate it in https://github.com/galaxy-genome-annotation/python-chado/pull/3. It's still WIP, but should be ok in a few days. (let me know if there's any political concern about this port of the perl code to python-chado, I made it just to ease my life with no intention to piss off anyone!)

I had to make some changes to the tripal_phylotree (lis flavour) code, I'll put it online as soon as it's stable enough. I mainly:

removed/replaced some text concerning legume
rewrote the materialized view and the corresponding drupal view displaying the number of gene per species. ie there were some hardcoded species name, now it's dynamic (but there is no more aggregation of gene number by genus)

I'm also working on using https://github.com/legumeinfo/lis_context_viewer/, I have added the loading code in python-chado, but I haven't finished testing it yet.

bradfordcondon commented 6 years ago

To load data, I think it could be possible to use the core phylotree module but I don't do it this way because there is no way to enable both the core module and this LIS module

Interesting. The core module is legacy though, so if you are on a tripal 3 site it's a non-issue and you can use the v3 loader. Unfortunately the current loader can only handle 1 tree at a time, so Im considering contributing a bulk load option for it. But, might be eaiser to just use the pre-existing perl loader if it works OK.

removed/replaced some text concerning legume

rewrote the materialized view and the corresponding drupal view displaying the number of gene per species. ie there were some hardcoded species name, now it's dynamic (but there is no more aggregation of gene number by genus)

Awesome I would be interested in using this, please share when you get a chance!

Thanks and keep me posted :)

abretaud commented 6 years ago

Interesting. The core module is legacy though, so if you are on a tripal 3 site it's a non-issue and you can use the v3 loader. Unfortunately the current loader can only handle 1 tree at a time, so Im considering contributing a bulk load option for it. But, might be eaiser to just use the pre-existing perl loader if it works OK.

Not using tripal 3 yet, I've started looking into it, but some API stuff is missing for me at the moment. It's in my todo list though.

For bulk loading, that's one of the little change I made in python-chado compared to perl scripts: now it can take a single newick file, or a directory of newick files

adf-ncgr commented 6 years ago

Hi guys- great to see this conversation and @abretaud we have absolutely no concern about the fate of those perl scripts; the fellow who originally wrote them will be quite delighted to learn that they've been reincarnated in python! We are also still not on tripal 3 but we have rough plans to make that move as well and would be happy to try to support an effort to port the module over. I'd also be interested in seeing the non-hardcoded version of the materialized view. @bradfordcondon I still need to respond to your email to the chado list regarding gene families, but seeing your interest in this module suggests that you may already be moving in the direction I was thinking of (although there are a few additional wrinkles in the way we do things that are somewhat outside the scope of this module that may also be relevant to you).

bradfordcondon commented 6 years ago

It's actually a multifaceted problem, the feature grouping. Loading in trees dbxref'd to phytozome families looks like a good solution for your workflow, and i would be incredibly grateful for any further input in storage, or methodology not documented elsewhere.

That said I think that feature grouping will remain an issue, since we may be interested in creating our own gene families, or creating orthologous groups which will inherently not link out to anything and therefore not be suitable for dbxref. This would be the output of OrthoFinder which @abretaud is also running (and apparently loading in the same manner as this module With the output groups as trees linked to the features via the nodes?)

few additional wrinkles in the way we do things that are somewhat outside the scope of this module that may also be relevant to you

please please and thank you fill me in :)

bradfordcondon commented 6 years ago

I had to make some changes to the tripal_phylotree (lis flavour) code, I'll put it online as soon as it's stable enough. I mainly:

hi @abretaud - friendly reminder i'm still interested in this if you have it available :)

abretaud commented 6 years ago

hi @bradfordcondon, I opened #28 if you want to have a look

bradfordcondon commented 6 years ago

Thank you very much, I appreciate it :)

legumeinfo / tripal_phylotree

Doc about loading data? #23