Data Sources - Githubissues

sammyjava commented 6 years ago

It's HIGH TIME to annotate all the data with the data source it came from. START IMPLEMENTING THIS.

sammyjava commented 6 years ago

Basic page is in place, more can be done on this topic so leaving it open.

vivekkrish commented 6 years ago

Hi @sammyjava (and @adf-ncgr) -

Both ThaleMine and MedicMine make use of the InterMine's built-in provenance classes called "DataSource" and "DataSet" to annotate the data being loaded into the mines.

For this purpose, we maintain a tab-delimited file called YOURMINE/integrate/datasets.txt that contains all relevant information (i.e. source/set name, description, URL, publication, version, release date).

This file is processed by a perl script bio/scripts/make-datasets-xml.pl, which is invoked like so:

    perl bio/scripts/make-datasets-xml.pl \
        YOURMINE/integrate/datasets.txt bio/core/core.xml \
        > YOURMINE/integrate/datasets.xml

The output is an XMLised version of the provenance info (based on the intermine-items-xml format), that is loaded as a "source" (via configuration in the project.xml).

Here is what happens when this script is executed:

If the input DataSource and DataSet information is as follows

DataSource  Panther            Orthologue and paralogue relationships based on the inferred speciation and gene duplication events in the phylogenetic tree   https://www.pantherdb.org   Publication:23193289
DataSet     Panther data set   Panther orthologues from Yeast, Roundworm, Fruit Fly, Zebrafish, Human, Mouse and Rat and paralogues from Arabidopsis          https://www.pantherdb.org   DataSource:Panther    8.1

The resultant XML produced by this script will be like so

<item id="0_1" class="Publication" implements="">
   <attribute name="pubMedId" value="23193289" />
</item>
<item id="0_2" class="DataSource" implements="">
   <attribute name="name" value="Panther" />
   <attribute name="description" value="Orthologue and paralogue relationships based on the inferred speciation and gene duplication events in the phylogenetic tree" />
   <attribute name="url" value="https://www.pantherdb.org" />
   <collection name="publications">
      <reference ref_id="0_1" />
   </collection>
</item>
<item id="0_3" class="DataSet" implements="">
   <attribute name="version" value="8.1" />
   <reference name="dataSource" ref_id="0_2" />
   <attribute name="name" value="Panther data set" />
   <attribute name="description" value="Panther orthologues from Yeast, Roundworm, Fruit Fly, Zebrafish, Human, Mouse and Rat and paralogues from Arabidopsis" />
   <attribute name="url" value="https://www.pantherdb.org" />
</item>

Once this information is populated into the mine, it is exposed via the "Data Sources" page. Example: http://medicmine.jcvi.org/medicmine/dataCategories.do (code: webapp/dataCategories.jsp)

I'm assuming that in your case, you can amend your chado-to-intermine loaders to also load the corresponding provenance information, and then construct a dynamic data sources page.

adf-ncgr commented 6 years ago

Thanks @vivekkrish! I probably don't grasp all the details, but offhand it seems like what you're describing could fit in naturally with the plan @sammyjava has to build the mines from the files in the DataStore instead of building them at secondhand from the chado; provided that we could establish the relevant conventions for encoding DataSource/DataSet metadata into the appropriate places (aka READMEs) in the DataStore. Worth some further discussion with DataStore curators?

sammyjava commented 6 years ago

@adf-ncgr as it stands right now, the stock GFF and FASTA loaders do a decent job of setting the DataSource and DataSet for the imported data. Take a look at the shokin-webapps BeanMine, it's got the bare bones there, and those can be filled out as per Vivek's technique or just manually adding some fields to project.xml. As mentioned at the top, I've totally ignored this with the chado and other file loaders.

sammyjava commented 4 years ago

The new LIS datastore-based loaders ALL populate both DataSet and DataSource. So this issue is "solved" in the new loaders, which will eventually replace the chado and custom-file loaders.

legumeinfo / legumemine

Data Sources #15