reactome / add-links

This is the AddLinks component of the Release system
0 stars 1 forks source link
data-release-pipeline mysql-release release

AddLinks

This is the AddLinks component of the Release system

TOC

Overview

AddLinks is a program whose purpose is to create links to external resources from Reactome.

High-level

At a high-level, AddLinks runs in four phases:

  1. Data file retrieval
  2. Data file processing
  3. Reference creation
  4. Link-checking

Data file retrieval

The phase will retrieve data that will be used to create links to external resources.

Typically, the data files are mapping files that map an identifier from one external resource to a different identifier in a different external resource.

Some file retrievers download a text file from a URL. Some will download from an FTP server. Some will make calls to a webservice and then store the results in a file on disc.

You can see a list of file retrievers here.

Some data retrievers have slightly different configurations and are grouped together in separate files. Ensembl data retrievers are here and uniprot file retrievers are configerd here.

Data file processing

Many of the files that are downloaded need some processing to prepare the data in such a way that it can easily be used by AddLinks. File procssing may be simple, such as extracting two columns from a TSV file. Other file processors are more complex and may use XSL transformations to turn an XML document into a format that is more usable.

You can see a list of file processors here.

Reference Creation

The reference creators will use the data that has been prepared in Data file processing to create new references in the Reactome database.

The list of reference creators can be seen here.

Link-checking

After the references have been created, AddLinks will check links to see if they are OK.

The list of reference databases to check can be seen here. Look for the Spring bean whose name is "referenceDatabasesToLinkCheck".

A little more detail...

The more detailed process of how AddLinks runs looks like this:

  1. Populate caches.

    AddLinks will build a cache of data from the main database that it is connected to. This is done to make data lookups faster.

  2. Create reference databases.

    Some of the reference databases are only populated by the AddLinks process, so the relevant ReferenceDatabase objects need to be created before anything else can be done. A list of the reference databases that AddLinks will create can be found here. Please be aware that for some databases, species-specific ReferenceDatabase objects will be created. For example, this applies to KEGG and ENSEMBL. See: src/main/java/org/reactome/addlinks/kegg/KEGGReferenceDatabaseGenerator.java and src/main/java/org/reactome/addlinks/ensembl/EnsemblReferenceDatabaseGenerator.java.

  3. Run "before" report

    A query will be executed to produce a report that shows how many external resources of which types exist for which databases.

  4. Execute data retrievers.

    The data retrievers will be executed. The list of data retrievers to execute comes from the filter list in the application-contect.xml file, and must be specified in the bean named "fileRetrieverFilter".

In this example below, only the file retrievers named "HGNC", "OrthologsFromZinc", "OrthologsCSVFromZinc" will be executed. No other retrievers will be executed. It is in this way that you can control AddLinks to run as many or as few of the defined file retrievers.

<util:list id="fileRetrieverFilter" value-type="java.lang.String">
        <value>HGNC</value>
        <value>OrthologsFromZinc</value>
        <value>OrthologsCSVFromZinc</value>
</util:list>
  1. Execute file processors This step can also be filtered. The filtering looks like this:
    <util:list id="fileProcessorFilter" value-type="java.lang.String">
        <value>HGNCProcessor</value>
        <value>zincOrthologFileProcessor</value>
        <value>FlyBaseFileProcessor</value>
    </util:list>
  1. Execute Reference creators. Again, like for file retrievers and file processors, a list can be specified in the application-context.xml to define which reference creators will execute. It should be noted that the Data Retrieval phase and the Data Processing phase can be executed completely independently. You could run AddLinks and only run the file retrievers (empty lists for everything else) and then run only the file processors (empty lists for everything else). You cannot run the Reference Creators indepently - they operate on the in-memory output of the file processors.

  2. After references are created, a report is run to show the current counts of external resources in the different databases. Additionally, a difference report will also be generated and saved to a file (named with a datestring, like this: "diffReport.txt").

  3. Purging unused ReferenceDatabases

    ReferenceDatabase objects are created at the very begining of AddLinks. At that point in time, it is not yet known if all of those ReferenceDatabase objects will have references that make use of them. After all of the references have been created, any ReferenceDatabase objects that are unused (not external resources reference them) are removed from the database.

  4. Link-checking

    AddLinks will attempt to check the links from the references it created to ensure that they are all OK. The list of ReferenceDatabases to perform link-checking on can be configured in the application-context.xml file. Some databases should not have link-checking performed as they will not provide the correct response. Some websites seem to return a 403 error code if they are not accessed with a web browser. Some load their content via JavaScript so it is impossible to verify the links without actually executing the JavaScript.

Configuration

AddLinks gets configuration values from Spring XML files. There are also some properties files that contain configuration settings.

XML configuration

By default, the XML files that contain configuration are:

Building & Running

Building AddLinks

Building AddLinks should be very simple. You should be able to perform these steps:

$ git clone https://github.com/reactome/AddLinks.git
$ cd AddLinks
$ mvn clean package -DskipTest=true

If you want to build and execute tests, make sure you have a Reactome database that AddLinks can connect to, and configure src/test/resources/db.properties with the correct values.

Running AddLinks

Running AddLinks can be done like this:

java -cp "$(pwd)/resources" \
    -Dconfig.location=$(pwd)/resources/addlinks.properties \
    -Dlog4j.configurationFile=$(pwd)/resources/log4j2.xml \
    -jar AddLinks.jar file://$(pwd)/resources/application-context.xml

You will need to execute this command from a directory which has a subdirectory named "resources", such that "resources" contains all of the necessary configuration files described above, as well as the application properties file (defined with the -Dconfig.location VM setting) and the logging configuration files (defined with the -Dlog4j.configuration VM setting). The AddLinks application itself takes one argument: the path to the Spring context file (application-context.xml).

Other notes

Some notes on other parts of the AddLinks system. Most of these notes describe exceptions to how the rest of the system is built.

Different data retrieval

Most of the data retrievers are designed to download a single file, or submit queries to a webservice to get data. Some of them work a little bit differently:

ENSEMBL

Getting cross-references from ENSEMBL requires first getting doing a batch mapping from ENSP to ENST, then batch mapping ENST to ENSG. Then, individual cross-reference lookups on ENSG to get other databases. The batch lookups require a specific species, as an input. So, getting data for ENSEMBL takes a few steps.

KEGG

To get data from KEGG, AddLinks must first get the UniProt-to-KEGG mappings. KEGG is then queried using these mapped values to get detailes for each of the KEGG identifiers. KEGG queries are species-specific.

UniProt

This is a pretty simple web-service call. The difference here with UniProt is that there will be a .tab file with the mappings from UniProt to some other database, and a .not file containing all of the identifiers that the UniProt web service could not map.

File processing

Reference creation

Some of the code that creates references behaves differently than most of the rest.

Zinc Orthologs

The code that creates ZINC orthologs first performs a query to the ZINC website to see if the identifier has an content for a specific type (biogenic, fda approved, etc...).

OMIM

The OMIM Reference Creator is a UniProt-mapped reference creator, but it uses its own OMIM file processor to filter the UniProt list with a list from OMIM before creating references.

CTD, Monarch, BioGPS, dbSNP

The Reference Creators for these databases are all NCBI-based. When the ENSEMBL or UniProt reference creators create references with NCBI identifiers, these other reference creators are also automatically executed. Additionally, for CTD there is a separate file processor and reference creator which filter based on a CTD file.