Closed sammyjava closed 4 years ago
Basic page is in place, more can be done on this topic so leaving it open.
Hi @sammyjava (and @adf-ncgr) -
Both ThaleMine and MedicMine make use of the InterMine's built-in provenance classes called "DataSource" and "DataSet" to annotate the data being loaded into the mines.
For this purpose, we maintain a tab-delimited file called YOURMINE/integrate/datasets.txt that contains all relevant information (i.e. source/set name, description, URL, publication, version, release date).
This file is processed by a perl script bio/scripts/make-datasets-xml.pl, which is invoked like so:
perl bio/scripts/make-datasets-xml.pl \
YOURMINE/integrate/datasets.txt bio/core/core.xml \
> YOURMINE/integrate/datasets.xml
The output is an XMLised version of the provenance info (based on the intermine-items-xml
format), that is loaded as a "source" (via configuration in the project.xml
).
Here is what happens when this script is executed:
DataSource Panther Orthologue and paralogue relationships based on the inferred speciation and gene duplication events in the phylogenetic tree https://www.pantherdb.org Publication:23193289
DataSet Panther data set Panther orthologues from Yeast, Roundworm, Fruit Fly, Zebrafish, Human, Mouse and Rat and paralogues from Arabidopsis https://www.pantherdb.org DataSource:Panther 8.1
<item id="0_1" class="Publication" implements="">
<attribute name="pubMedId" value="23193289" />
</item>
<item id="0_2" class="DataSource" implements="">
<attribute name="name" value="Panther" />
<attribute name="description" value="Orthologue and paralogue relationships based on the inferred speciation and gene duplication events in the phylogenetic tree" />
<attribute name="url" value="https://www.pantherdb.org" />
<collection name="publications">
<reference ref_id="0_1" />
</collection>
</item>
<item id="0_3" class="DataSet" implements="">
<attribute name="version" value="8.1" />
<reference name="dataSource" ref_id="0_2" />
<attribute name="name" value="Panther data set" />
<attribute name="description" value="Panther orthologues from Yeast, Roundworm, Fruit Fly, Zebrafish, Human, Mouse and Rat and paralogues from Arabidopsis" />
<attribute name="url" value="https://www.pantherdb.org" />
</item>
Once this information is populated into the mine, it is exposed via the "Data Sources" page. Example: http://medicmine.jcvi.org/medicmine/dataCategories.do (code: webapp/dataCategories.jsp)
I'm assuming that in your case, you can amend your chado-to-intermine loaders to also load the corresponding provenance information, and then construct a dynamic data sources page.
Thanks @vivekkrish! I probably don't grasp all the details, but offhand it seems like what you're describing could fit in naturally with the plan @sammyjava has to build the mines from the files in the DataStore instead of building them at secondhand from the chado; provided that we could establish the relevant conventions for encoding DataSource/DataSet metadata into the appropriate places (aka READMEs) in the DataStore. Worth some further discussion with DataStore curators?
@adf-ncgr as it stands right now, the stock GFF and FASTA loaders do a decent job of setting the DataSource and DataSet for the imported data. Take a look at the shokin-webapps BeanMine, it's got the bare bones there, and those can be filled out as per Vivek's technique or just manually adding some fields to project.xml. As mentioned at the top, I've totally ignored this with the chado and other file loaders.
The new LIS datastore-based loaders ALL populate both DataSet and DataSource. So this issue is "solved" in the new loaders, which will eventually replace the chado and custom-file loaders.
It's HIGH TIME to annotate all the data with the data source it came from. START IMPLEMENTING THIS.