legumeinfo / legumemine

An InterMine which contains multiple legumes
GNU Lesser General Public License v3.0
0 stars 0 forks source link

Data Sources #15

Closed sammyjava closed 4 years ago

sammyjava commented 6 years ago

It's HIGH TIME to annotate all the data with the data source it came from. START IMPLEMENTING THIS.

sammyjava commented 6 years ago

Basic page is in place, more can be done on this topic so leaving it open.

vivekkrish commented 6 years ago

Hi @sammyjava (and @adf-ncgr) -

Both ThaleMine and MedicMine make use of the InterMine's built-in provenance classes called "DataSource" and "DataSet" to annotate the data being loaded into the mines.

For this purpose, we maintain a tab-delimited file called YOURMINE/integrate/datasets.txt that contains all relevant information (i.e. source/set name, description, URL, publication, version, release date).

This file is processed by a perl script bio/scripts/make-datasets-xml.pl, which is invoked like so:

    perl bio/scripts/make-datasets-xml.pl \
        YOURMINE/integrate/datasets.txt bio/core/core.xml \
        > YOURMINE/integrate/datasets.xml

The output is an XMLised version of the provenance info (based on the intermine-items-xml format), that is loaded as a "source" (via configuration in the project.xml).

Here is what happens when this script is executed:

Once this information is populated into the mine, it is exposed via the "Data Sources" page. Example: http://medicmine.jcvi.org/medicmine/dataCategories.do (code: webapp/dataCategories.jsp)

I'm assuming that in your case, you can amend your chado-to-intermine loaders to also load the corresponding provenance information, and then construct a dynamic data sources page.

adf-ncgr commented 6 years ago

Thanks @vivekkrish! I probably don't grasp all the details, but offhand it seems like what you're describing could fit in naturally with the plan @sammyjava has to build the mines from the files in the DataStore instead of building them at secondhand from the chado; provided that we could establish the relevant conventions for encoding DataSource/DataSet metadata into the appropriate places (aka READMEs) in the DataStore. Worth some further discussion with DataStore curators?

sammyjava commented 6 years ago

@adf-ncgr as it stands right now, the stock GFF and FASTA loaders do a decent job of setting the DataSource and DataSet for the imported data. Take a look at the shokin-webapps BeanMine, it's got the bare bones there, and those can be filled out as per Vivek's technique or just manually adding some fields to project.xml. As mentioned at the top, I've totally ignored this with the chado and other file loaders.

sammyjava commented 4 years ago

The new LIS datastore-based loaders ALL populate both DataSet and DataSource. So this issue is "solved" in the new loaders, which will eventually replace the chado and custom-file loaders.