This is the AddLinks component of the Release system
AddLinks is a program whose purpose is to create links to external resources from Reactome.
At a high-level, AddLinks runs in four phases:
The phase will retrieve data that will be used to create links to external resources.
Typically, the data files are mapping files that map an identifier from one external resource to a different identifier in a different external resource.
Some file retrievers download a text file from a URL. Some will download from an FTP server. Some will make calls to a webservice and then store the results in a file on disc.
You can see a list of file retrievers here.
Some data retrievers have slightly different configurations and are grouped together in separate files. Ensembl data retrievers are here and uniprot file retrievers are configerd here.
Many of the files that are downloaded need some processing to prepare the data in such a way that it can easily be used by AddLinks. File procssing may be simple, such as extracting two columns from a TSV file. Other file processors are more complex and may use XSL transformations to turn an XML document into a format that is more usable.
You can see a list of file processors here.
The reference creators will use the data that has been prepared in Data file processing to create new references in the Reactome database.
The list of reference creators can be seen here.
After the references have been created, AddLinks will check links to see if they are OK.
The list of reference databases to check can be seen here. Look for the Spring bean whose name is "referenceDatabasesToLinkCheck".
The more detailed process of how AddLinks runs looks like this:
Populate caches.
AddLinks will build a cache of data from the main database that it is connected to. This is done to make data lookups faster.
Create reference databases.
Some of the reference databases are only populated by the AddLinks process, so the relevant ReferenceDatabase objects need to be created before anything else can be done. A list of the reference databases that AddLinks will create can be found here. Please be aware that for some databases, species-specific ReferenceDatabase objects will be created. For example, this applies to KEGG and ENSEMBL. See: src/main/java/org/reactome/addlinks/kegg/KEGGReferenceDatabaseGenerator.java and src/main/java/org/reactome/addlinks/ensembl/EnsemblReferenceDatabaseGenerator.java.
Run "before" report
A query will be executed to produce a report that shows how many external resources of which types exist for which databases.
Execute data retrievers.
The data retrievers will be executed. The list of data retrievers to execute comes from the filter list in the application-contect.xml file, and must be specified in the bean named "fileRetrieverFilter".
In this example below, only the file retrievers named "HGNC", "OrthologsFromZinc", "OrthologsCSVFromZinc" will be executed. No other retrievers will be executed. It is in this way that you can control AddLinks to run as many or as few of the defined file retrievers.
<util:list id="fileRetrieverFilter" value-type="java.lang.String">
<value>HGNC</value>
<value>OrthologsFromZinc</value>
<value>OrthologsCSVFromZinc</value>
</util:list>
<util:list id="fileProcessorFilter" value-type="java.lang.String">
<value>HGNCProcessor</value>
<value>zincOrthologFileProcessor</value>
<value>FlyBaseFileProcessor</value>
</util:list>
Execute Reference creators. Again, like for file retrievers and file processors, a list can be specified in the application-context.xml to define which reference creators will execute. It should be noted that the Data Retrieval phase and the Data Processing phase can be executed completely independently. You could run AddLinks and only run the file retrievers (empty lists for everything else) and then run only the file processors (empty lists for everything else). You cannot run the Reference Creators indepently - they operate on the in-memory output of the file processors.
After references are created, a report is run to show the current counts of external resources in the different databases. Additionally, a difference report will also be generated and saved to a file (named with a datestring, like this: "diffReport
Purging unused ReferenceDatabases
ReferenceDatabase objects are created at the very begining of AddLinks. At that point in time, it is not yet known if all of those ReferenceDatabase objects will have references that make use of them. After all of the references have been created, any ReferenceDatabase objects that are unused (not external resources reference them) are removed from the database.
Link-checking
AddLinks will attempt to check the links from the references it created to ensure that they are all OK. The list of ReferenceDatabases to perform link-checking on can be configured in the application-context.xml file. Some databases should not have link-checking performed as they will not provide the correct response. Some websites seem to return a 403 error code if they are not accessed with a web browser. Some load their content via JavaScript so it is impossible to verify the links without actually executing the JavaScript.
AddLinks gets configuration values from Spring XML files. There are also some properties files that contain configuration settings.
By default, the XML files that contain configuration are:
application-context.xml - This file contains the main configuration. It contains
addlinks.properties - This file contains some property values, some of which will be used by the XML configuration.
db.properties - This file contains database connection configuration information.
database.user
.auth.properties - This file will contain usernames and passwords for other sites that you will connect to.
logging.properties - This file contains basic settings to configure logging.
retrievers
for data retrievers, file-processors
for data processors, and refCreators
for reference creators. More generic log messages may go to addlinks.log
. Logs will also be archived in subdirectries named with the date.log4j2.xml - This is the log4j2 configuration file. It is strongly recommended to not touch this file, unless you have a good understanding of log4j2 configuration.
Building AddLinks should be very simple. You should be able to perform these steps:
$ git clone https://github.com/reactome/AddLinks.git
$ cd AddLinks
$ mvn clean package -DskipTest=true
If you want to build and execute tests, make sure you have a Reactome database that AddLinks can connect to, and configure src/test/resources/db.properties with the correct values.
Running AddLinks can be done like this:
java -cp "$(pwd)/resources" \
-Dconfig.location=$(pwd)/resources/addlinks.properties \
-Dlog4j.configurationFile=$(pwd)/resources/log4j2.xml \
-jar AddLinks.jar file://$(pwd)/resources/application-context.xml
You will need to execute this command from a directory which has a subdirectory named "resources", such that "resources" contains all of the necessary configuration files described above, as well as the application properties file (defined with the -Dconfig.location
VM setting) and the logging configuration files (defined with the -Dlog4j.configuration
VM setting). The AddLinks application itself takes one argument: the path to the Spring context file (application-context.xml).
Some notes on other parts of the AddLinks system. Most of these notes describe exceptions to how the rest of the system is built.
Most of the data retrievers are designed to download a single file, or submit queries to a webservice to get data. Some of them work a little bit differently:
Getting cross-references from ENSEMBL requires first getting doing a batch mapping from ENSP to ENST, then batch mapping ENST to ENSG. Then, individual cross-reference lookups on ENSG to get other databases. The batch lookups require a specific species, as an input. So, getting data for ENSEMBL takes a few steps.
To get data from KEGG, AddLinks must first get the UniProt-to-KEGG mappings. KEGG is then queried using these mapped values to get detailes for each of the KEGG identifiers. KEGG queries are species-specific.
This is a pretty simple web-service call. The difference here with UniProt is that there will be a .tab file with the mappings from UniProt to some other database, and a .not file containing all of the identifiers that the UniProt web service could not map.
Most file processors operate on text files, usually tab or comma delimited. Some file processors operate on XML. In these cases, there is usually one or more XSL files that is used to transform the XML into a much simpler structure (usually a CSV or TSV). This is done for ENSEMBL, Orphanet, and HMDB.
Some file processors operate on file globs (file name patterns). These include file processors for ENSEMBL, KEGG, UniProt, and OMIM. Usually, this is done becuase there are multiple input files, often distinguished by the target database of the mapping, a species ID, or both.
Some of the code that creates references behaves differently than most of the rest.
The code that creates ZINC orthologs first performs a query to the ZINC website to see if the identifier has an content for a specific type (biogenic, fda approved, etc...).
The OMIM Reference Creator is a UniProt-mapped reference creator, but it uses its own OMIM file processor to filter the UniProt list with a list from OMIM before creating references.
The Reference Creators for these databases are all NCBI-based. When the ENSEMBL or UniProt reference creators create references with NCBI identifiers, these other reference creators are also automatically executed. Additionally, for CTD there is a separate file processor and reference creator which filter based on a CTD file.