intermine / intermine

A powerful open source data warehouse system
http://intermine.org
Other
251 stars 350 forks source link

HumanMine build #291

Closed boboppie closed 10 years ago

boboppie commented 11 years ago

/home/fh293/git/intermine-clone-1/imbuild/integrate.xml:54: The following error occurred while executing this line: /home/fh293/git/intermine-clone-1/imbuild/source.xml:330: java.lang.RuntimeException: Exception while dataloading - to allow multiple errors, set the property "dataLoader.allowMultipleE rrors" to true Problem while loading item identifier 0_46 because Conflicting values for field Gene.sequenceOntologyTerm between protein-atlas (value "SOTerm [description="A region (or regions) that includes all of the sequence elements necessary to e ncode a functional transcript. A gene may include regulatory regions, transcribed regions and/or other functional sequence regions.", id="1000017", identifier="SO:0000704", name="gene", namespace="sequence", obsolete="false", ontology=1000000]" in database with ID 10008506) and ensembl-human (value "SOTerm [description="null", id="6000081", identifier="SO:0000010", na me="protein_coding", namespace="sequence", obsolete="false", ontology=1000000]" being stored). This field needs configuring in the genomic_priorities.properties file at org.intermine.dataloader.ObjectStoreDataLoader.process(ObjectStoreDataLoader.java:165)

The gene id is ENSG00000198888, in protein-atlas, we set "gene" as default SOTerm, but ensembl-human uses a different one:

      <reference name="sequenceOntologyTerm" ref_id="0_47" />
      <attribute name="primaryIdentifier" value="ENSG00000198888" />
      <collection name="dataSets">
         <reference ref_id="0_3" />
      </collection>
      <reference name="chromosome" ref_id="0_5" />
      <reference name="organism" ref_id="0_2" />
   </item>
   <item id="0_48" class="" implements="Location">
      <collection name="dataSets">
         <reference ref_id="0_3" />
      </collection>
      <attribute name="end" value="4262" />
      <reference name="feature" ref_id="0_46" />
      <reference name="locatedOn" ref_id="0_5" />
      <attribute name="strand" value="1" />
      <attribute name="start" value="3307" />
   </item>
   <item id="0_47" class="" implements="SOTerm">
      <attribute name="name" value="protein_coding" />
      <collection name="dataSets">
         <reference ref_id="0_3" />
      </collection>
      <reference name="ontology" ref_id="0_4" />
   </item>

Set ensembl-human a higher priority.

boboppie commented 11 years ago

TODO create an URL - human.intermine.org

boboppie commented 10 years ago

installed bioseg

boboppie commented 10 years ago

TODO - Need a human identifier resolver to include ncbi, ensembl and hgnc, then resolve any id to ncbi id.

boboppie commented 10 years ago

The human id resolver will miss some/many transcripts (mRNA, etc.) from ensembl since there is no equivalent entities in NCBI/HGNC (different gene models). e.g. ENST00000000233 (http://www.ensembl.org/Homo_sapiens/Transcript/Summary?g=ENSG00000004059;r=7:127228399-127231759;t=ENST00000000233), the gene (ENSG00000004059) has another 5 transcripts products, but only ENST00000000233 has a CCDS id (CCDS34745), this is also the case in HGNC (http://www.genenames.org/data/hgnc_data.php?hgnc_id=658), only this transcript can be resolved by CCDS id (not 100% match), 5 ensembl entities will be lost.

boboppie commented 10 years ago

exons/CDSs don't have ids/names in genbank, but ensembl has internal ids for them. How to resolve? Discard?

boboppie commented 10 years ago

ncbi-summary resolves Entrez ids to HGNC symbols which causes loss of 134 genes (mostly microRNAs), e.g. http://www.ncbi.nlm.nih.gov/gene/100526648