legumeinfo / tripal_phylotree

LIS project- tripal module for chado phylogeny and gene families
GNU General Public License v2.0
1 stars 7 forks source link

Sample data #33

Open spficklin opened 6 years ago

spficklin commented 6 years ago

In an effort to write unit testing for the Newick file importer that comes with Tripal, do you have a file that could be shared? We would need the file in newick format, a FASTA file containing all of the gene/protein sequences and the organism to which those FASTA sequences belong.

Thanks much!

adf-ncgr commented 6 years ago

Hi @spficklin - all the data for our gene families trees is available here: https://legumeinfo.org/data/public/Gene_families/legume.genefam.fam1.M65K/

If you grab the tarball of trees: legume.genefam.fam1.M65K.trees_ML_rooted.tar.gz and the corresponding tarball of per-family fastas: legume.genefam.fam1.M65K.family_fasta.tar.gz

I think that will give you what you wanted; note that these sequences are the unaligned versions, but their IDs should correspond to the leaf node labels in the trees (if they don't let me know- it's possible the tarball hasn't been updated to reflect some fixes in that regard)

regarding the organisms, I'm not sure what exactly you'll need but we are using the "gensp." prefixing to denote the species of origin (ie "glyma" => Glycine max, "medtr" => Medicago truncatula, etc.); can give you more detailed list if I know how you plan to handle this (in our case, the loader expects that the annotations have already been loaded and just does a lookup for them)

spficklin commented 6 years ago

Thanks @adf-ncgr . I've gotten back to this. Do you have a lookup table that maps your organism "gensp" prefix to the taxonomic name? I want to import a FASTA file from one I downloaded using the file you mentioned above but I need to know the species that each belongs to.

adf-ncgr commented 6 years ago

Hi @spficklin- there may be a few quirks in the following extraction from our organism table, in particular with some of the non-legume species, but hopefully it will be close enough to give you the relevant info (e.g. you'll probably see easily that Arabidopsis thaliana would be arath in "gensp" representation instead of A. thaliana). Let me know if there's anything in the fasta you grabbed that you can't glean from this, or if you have other questions- thanks for moving it along...

  abbreviation      |    genus     |         species

------------------------+--------------+-------------------------- glyma | Glycine | max lupal | Lupinus | albus O. sativa | Oryza | sativa A. thaliana | Arabidopsis | thaliana phaco | Phaseolus | coccineus vicfa | Vicia | faba P. persica | Prunus | persica S. lycopersicum | Solanum | lycopersicum V. vinifera | Vitis | vinifera Z. mays | Zea | mays A. trichopoda | Amborella | trichopoda araip | Arachis | ipaensis consensus | consensus | consensus lencu | Lens | culinaris cajca | Cajanus | cajan cicar.ICC4958 | Cicer | arietinum_ICC4958 trire | Trifolium | repens cicar.CDCFrontier | Cicer | arietinum_CDCFrontier medtr | Medicago | truncatula vigra | Vigna | radiata lotja | Lotus | japonicus lupan | Lupinus | angustifolius tripr | Trifolium | pratense medsa | Medicago | sativa vigun | Vigna | unguiculata apiam | Apios | americana cucsa | Cucumis | sativus chafa | Chamaecrista | fasciculata prupe.Lovell.gnm2.ann1 | Prunus | persica.Lovell.gnm2.ann1 vigan | Vigna | angularis pea | Pisum | sativum arahy | Arachis | hypogaea aradu | Arachis | duranensis phavu | Phaseolus | vulgaris

spficklin commented 6 years ago

This is great. Thanks. I'll let you know how it goes.