mrtrkmn / orbi

This repository is created to keep files updated for IDP in The Dr. Theo Schöller Chair of Technology and Innovation Management
https://orbi.mrturkmen.com
1 stars 0 forks source link
idp

Crawl Data

This is a simple crawler that crawls data from two websites currently:

for company and patent related data.

How to run

./orbi contains the main script which is used to run the crawler. It contains two different classes to crawl data from the websites.

All process is automated by using selenium and chromedriver.

On Local Dev Machine

Since on Github actions, the script is using environment variables, it is required to have the environment variables set on your local machine. Providing all environment variables through commandline would be a bit tedious, so I have created a config file which is used by the script to load the environment variables. Check sample config file from here.

$ python3 -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txt

Orbi (batch search on orbis database)

This part explains running it on local machine. For running it on remote, check out the On Remote section.

$ LOCAL_DEV=True CONFIG_PATH=./config/config.yaml CHECK_ON_SEC=False python orbi/orbi.py

Make sure that you are defining the path to the config file correctly.

Crawl (scraping data from sec.gov website)

$ python orbi/crawl.py 

  Example usage:

    python orbi/crawl.py --source_file sample_data.xlsx --output_file company_facts.csv --licensee  # searching over licensee information 
    python orbi/crawl.py --source_file sample_data.xlsx --output_file company_facts.csv --no-licensee # searching over licensor information 

Example call for licensee field:

python orbi/crawl.py --source_file sample_data.xlsx --output_file company_facts.csv --licensee 

On Remote

To run the crawler classs seperately , check out the commented code in ./orbi/crawl.py` file.

Specifically, this line: ./orbi/orbi.py#494

Automation of Orbis database access and batch search on Orbis database

Produced files by Orbi class
orbis_aggregated_data_{timestamp}.csv : example --> orbis_aggregated_data_13_01_2023.csv
orbis_aggregated_data_{timestamp}.xlsx : example --> orbis_aggregated_data_13_01_2023.xlsx
orbis_aggregated_data_licensee_{timestamp}.xlsx : example --> orbis_aggregated_data_licensee_14_01_2023.xlsx
orbis_aggregated_data_licensor_{timestamp}.xlsx : example --> orbis_aggregated_data_licensor_14_01_2023.xlsx
orbis_data_licensee_{timestamp}.csv : example --> orbis_data_licensee_14_01_2023.csv
orbis_data_licensee_14_01_2023.xlsx : example --> orbis_data_licensee_14_01_2023.xlsx
orbis_data_licensee_guo_{timestamp}.csv : example --> orbis_data_licensee_guo_14_01_2023.csv
orbis_data_licensee_guo_{timestamp}.xlsx : example --> orbis_data_licensee_guo_14_01_2023.xlsx
orbis_data_licensee_ish_{timestamp}.csv : example --> orbis_data_licensee_ish_14_01_2023.csv
orbis_data_licensee_ish_{timestamp}.xlsx : example --> orbis_data_licensee_ish_14_01_2023.xlsx
orbis_data_licensor_{timestamp}.csv  : example --> orbis_data_licensor_14_01_2023.csv
orbis_data_licensor_{timestamp}.xlsx : example --> orbis_data_licensor_14_01_2023.xlsx
orbis_data_licensor_guo_{timestamp}.csv : example --> orbis_data_licensor_guo_14_01_2023.csv
orbis_data_licensor_guo_{timestamp}.xlsx : example --> orbis_data_licensor_guo_14_01_2023.xlsx
orbis_data_licensor_ish_{timestamp}.csv : example --> orbis_data_licensor_ish_14_01_2023.csv
orbis_data_licensor_ish_{timestamp}.xlsx : example --> orbis_data_licensor_ish_14_01_2023.xlsx
- sample_data.xlsx
Produced files by Crawler class
orbis_aggregated_data_{timestamp}.csv 
orbis_data_licensee_{timestamp}.csv
orbis_data_licensee_guo_{timestamp}.csv
orbis_data_licensee_ish_{timestamp}.csv
orbis_data_licensor_{timestamp}.csv
orbis_data_licensor_guo_{timestamp}.csv
orbis_data_licensor_ish_{timestamp}.csv
Produced XLSX files by Orbi class - END RESULT -
orbis_aggregated_data_{timestamp}.xlsx
orbis_aggregated_data_licensee_{timestamp}.xlsx
orbis_aggregated_data_licensor_{timestamp}.xlsx
orbis_data_licensee_{timestamp}.xlsx
orbis_data_licensee_guo_{timestamp}.xlsx
orbis_data_licensee_ish_{timestamp}.xlsx
orbis_data_licensor_{timestamp}.xlsx
orbis_data_licensor_guo_{timestamp}.xlsx
orbis_data_licensor_ish_{timestamp}.xlsx

Slack Integration

Currently, action results are uploaded to AWS S3 service and accesible with the link sent to private Slack channel. The files can be downloaded as decribed in the slack channel.

Run orbi from Slack

Orbi can be triggered on Github from slack when you are in tum-tim.slack.com workspace.

Any user who writes in the message field of Slack the following command and press 'Enter', Orbi will start the process on Github:

/run-orbis-crawler 

You will receive a result as shown below from Slack.


how-to-run-orbi-from-slack


After it is initialized, you will receive a message to #idp-data-c channel on Slack similar to the following:


Initial Notification


When it is done successfully, you will have a new notification with the link which provides access to data that similar to following:

Screenshot 2023-03-01 at 13 47 46


In case of error on the process, similar notification will be received as provided below:

Error notification

Main Workflow

Beside the given main workflow given below, there are other options which can be used with this repository.

The workflow is subject to change in time.


Batch Search Flow Chart

The following flow chart shows the process of batch search done by Orbi.

Flowchart of the batch search functionality of Orbi.