This project streamlines the process of converting specific PAINT IBA release data into a JSON format that can be indexed into Elasticsearch. It comprises four main components: data conversion, data loading, API setup, and front-end site development. Below, you will find a high-level overview of each component, setup instructions, and links to detailed documentation for each part.
The Data Conversion component converts data related to a specific PAINT IBA release into JSON files suitable for Elasticsearch. This includes extracting and structuring annotation data from GAF files, managing ontology with ROBOT tools, and preparing gene lists.
Key Operations:
human_iba_annotations.json
from a GAF filefull_go_annotated.json
human_iba_gene_info.json
containing distinct gene information.For more detailed steps and usage instructions, visit Data Conversion README
The Loader component manages the loading of processed data into Elasticsearch, ensuring that the data is accurately indexed and efficiently stored. This involves fetching article metadata, preprocessing annotations, and handling the actual indexing into Elasticsearch.
Setup and Operation:
Initial Setup: Install necessary Python packages and set up environment variables to configure the system properly:
cp .env-example .env # Setup environment variables
Fetching Article Metadata: Retrieve unique PMIDs from human_iba_annotations.json
and use the NCBI eUtils API to fetch article metadata needed for processing:
python3 -m src.get_articles -a ./data/test_data/sample_human_iba_annotations.json -o /download/articles.json
Note: Due to API rate limits, the script includes delays to manage request frequency as recommended by NCBI guidelines NCBI API.
Data Preprocessing: Before indexing, preprocess the data to replace term IDs with their metadata, associate gene IDs with gene metadata, and include article metadata. Determine the nature of each annotation (direct or via homology).
python3 -m src.clean_annotations -a Annotations_Json -t Terms_Json -art Articles_Json -g Genes_Json -o Output_of_Clean_Annotation
Creating the Index: Load the cleaned data into Elasticsearch using the following script:
python3 -m src.index_es -a $clean_annotations
For more detailed setup and operational instructions, visit Loader README.
The API component is powered by FastAPI and GraphQL, providing a robust interface for handling complex data interactions with Elasticsearch. It offers a high-performance, flexible API setup ideal for both development and production environments.
Setup and Operation:
Environment Setup: Begin by installing the required Python packages and setting up the necessary environment variables:
cp .env-example .env # Configure environment variables
Running the Server: Utilize Uvicorn to run the API server. For development, you can enable live reloading, but remember to disable this in production:
python3 -m main # Run the server, ensure reload is disabled for production
For further details on setting up and running the API, refer to the API README.
An Angular-based front-end provides a user-friendly interface to interact with the data through web requests to the API.
Development Setup:
Front-end development guidelines and setup instructions are detailed at Site README.
To begin using this pipeline, ensure all prerequisites are installed. Then, clone this repository and follow the setup instructions in each component's detailed documentation:
git clone https://github.com/pantherdb/pango
cd pango