Open lwinfree opened 7 years ago
Copied from an email, from Kent: "We don't currently support a standard dipper format, but we aim to support common formats such as GAF, VCF, GFF3. I would be curious what their thoughts are on using Panther for orthology as we already support this, or if they have additional or improved models for orthology.
As always any phenotype or data on disease models would be a top priority, if they have that."
Noting #329 . Over on that side, we want to have a fairly standardized ingest method using SPARQL, etc.
Taken from email from Xenbase, about the info they have compiled so far:
genes This information is mostly included in <redacted - ask Lilly> in columns 1, 2 and 5. What is missing is a Sequence Ontology reference, although I believe all of ours will be SO:0000704, I'm not sure if we have any pseudogenes (SO:0000336) on our gene pages.
sequence alterations(includes SNPs/del/ins/indel and large chromosomal rearrangements) We do not capture this sort of genetic variation information, the closest we come is in mutant lines, of which we only have one.
transgenic constructs We don't have a file on the FTP site but some of the relevant information is in the transgene obo file we use in Phenote, <redacted - ask Lilly>. We do not have links between constructs and normal gene pages except in a few cases and in fact we only have such links for specific promoter driven constructs under the current system, it might be desirable to add this functionality to other transgene elements such as ORFs.
morpholinos, talens, crisprs as expression-affecting reagents We have this information in our morpholino data tables but not in an export file. Currently we have nothing for TALENS and CRISPRs, although we certainly could capture CRISPR guide RNAs in a system similar to our current one for morpholinos. Our only information on guide RNAs at the moment is in free text notes in the experimental manipulation fields of Phenote annotations.
genotypes, and their components We do not decompose our genotypes in the same way as ZFIN, the closest we have is a theoretical linkage between Strains, Transgenes and Lines which doesn't exist in most cases. Our system allows Lines to be assigned to a Background Strain and to have a Transgene linked to it through the 'Transgene' tab, but in most cases the linkages are absent.
fish (as comprised of intrinsic and extrinsic genotypes) Currently this data is only contained in our Phenote data as the combination of ‘Background’ and particular types of Experimental Modification (specifically Morpholinos and mRNAs) . We do not yet have a clearly defined delineation of exactly which elements should be incorporated in the FROG name, chemicals perhaps, and which left out, general environmental manipulations such as heat shock for example.
publications (and their mapping to PMIDs, if available) The relevant mapping is mostly contained in Literature Matched Genes By Paper.txt although it doesn't have as much detail as the equivalent ZFIN file. The additional information such as authors and titles is in the database but not the export file. The current file lacks mappings for papers which do not have referenced genes curated, although such papers might have phenotype data.
genotype-to-phenotype associations (including environments and stages at which they are assayed) Currently this data is only contained in our Phenote data.
environmental components Currently this data is only contained in our Phenote data. These form the converse set to those manipulations included in the ‘fish (as comprised of intrinsic and extrinsic genotypes)’ category. In ZFIN’s case they also include chemical treatments in this category.
orthology to human genes This is in the Xenbase Gene Human Ortholog Mapping.txt file.
genetic positional information for genes and sequence alterations We have gene position information in gff files but we do not capture sequence alteration data like this. We have no equivalent to the data in the ZFIN ‘mappings.txt’ file which has information on genetic loci in rather old fashioned centiMorgan and centiRay units, Xenopus has not historically been a model genetic system so lacks similar linkage map analyses, though some exist for tropicalis we do not have them.
fish-to-disease model associations Our equivalent file would be Xenbase Omim Data.txt but we use OMIM rather than the Disease Ontology (DO) which ZFIN links to. We do have the option for curating DO id associations in Phenote and the Phenote file would also capture the relevant PMID reference and evidence code.
All of those are our best approximate equivalents to the ZFIN files pulled by Dipper. They may not be the best sources for the specific data required by Monarch.
Thanks Malcolm all of the information from y'all is very helpful right now! :)
This ticket is for starting documentation to allow external collaborators to ingest their data sources. This ticket was inspired by Xenbase contacting us requesting help with documentation for the core data necessary for a source to be ingested.
Goals: