After #22, elsametric should be used as a separate package and the current repo should only host files necessary for its development. To achieve this, the scope of the package elsametric must be determined and to do that, its mission has to be defined first.
What is elsametric?
It began as a collection of loose ideas about how to break down the raw academic publication data (such as data obtained from Scopus) in a database, which then could be queried to extract information.
elsametric has two main parts:
a part which designs the database using SQLAlchemy (the models directory)
a part which gives access to some process functions that help populate the database with the data gathered from here and there (the helpers directory, along with db_populate.py)
There are also some other parts as well:
some (mostly visual) files on the current shape of the database (the db_design folder)
some Python scripts and Jupyter Notebooks which usually contain old, experimental scripts, including:
custom.ipynb which retrieves data from Scopus API
Tehran.py & Modares.py which attempt to crawl faculty data from the University of Tehran and Tarbiat Modares University respectively
queries.py & queries.ipynb which were used to test different queries
These files should be moved to either the junkyard or the helper_scripts directories.
Additionally, the repo contains:
config.json which is used to connect to the database, populate it using different sources, configure the API, ...
gsc_profile.py in the helper_scripts directory which is used to get author metrics from google scholar
To create the API, two other files were created: main.py (API routes) & api_queries.py, which includes process functions for the API to work.
What should elsametric do (and what should it not)
elsametric is about designing and maintaining an efficient database to store academic publications data. As such, it should be consisted of:
the elsametric folder, which includes the SQLAlchemy model and some helper functions to process data
the db_design folder, which holds a graphical version of the SQLAlchemy model, created using MySQL Workbench
scripts to populate the database, which at the moment only includes db_populate.py
scripts to gather data from the web, including:
scripts for getting publications data from servers such as Scopus & WOS
crawlers for getting the profile of the faculty members
crawlers for getting author metrics (such as h-index from google scholar)
Of the items mentioned above, only the elsametric directory will be installed using pip. Other scripts and files reside solely in the repo. Future releases might install them along with the elsametric folder.
Other functionality regarding the growth and maintenance of the database can be included in the future. For example, the CSV-processing functions in the shcopus repo which can analyze CSV export from Scopus can be added to this repo, in case of Scopus API limitations.
Yet other functionality might include ways of migrating the database, probably using SQLAlchemy's Alembic. These tools will enable the package to avoid re-populating the entire database, every time a change in the structure is needed.
elsametric is not about creating and maintaining a webserver or an API. That should be the job of another repo. Hence, the files main.py and api_queries.py are to be moved out of this repository.
Any remaining script, whether Python or Jupyter Notebook, should be moved to the helper_scripts, and if they are not needed, to the junkyard directory. Eventually, the junkyard folder should be reviewed for any useful files and subsequently deleted from the repo... one should travel light!
Introduction
After #22,
elsametric
should be used as a separate package and the current repo should only host files necessary for its development. To achieve this, the scope of the packageelsametric
must be determined and to do that, its mission has to be defined first.What is
elsametric
?It began as a collection of loose ideas about how to break down the raw academic publication data (such as data obtained from Scopus) in a database, which then could be queried to extract information.
elsametric
has two main parts:models
directory)helpers
directory, along withdb_populate.py
)There are also some other parts as well:
db_design
folder)custom.ipynb
which retrieves data from Scopus APITehran.py
&Modares.py
which attempt to crawl faculty data from the University of Tehran and Tarbiat Modares University respectivelyqueries.py
&queries.ipynb
which were used to test different queriesThese files should be moved to either the
junkyard
or thehelper_scripts
directories.Additionally, the repo contains:
config.json
which is used to connect to the database, populate it using different sources, configure the API, ...gsc_profile.py
in thehelper_scripts
directory which is used to get author metrics from google scholarTo create the API, two other files were created:
main.py
(API routes) &api_queries.py
, which includes process functions for the API to work.What should
elsametric
do (and what should it not)elsametric
is about designing and maintaining an efficient database to store academic publications data. As such, it should be consisted of:elsametric
folder, which includes the SQLAlchemy model and some helper functions to process datadb_design
folder, which holds a graphical version of the SQLAlchemy model, created using MySQL Workbenchdb_populate.py
Of the items mentioned above, only the
elsametric
directory will be installed usingpip
. Other scripts and files reside solely in the repo. Future releases might install them along with theelsametric
folder.Other functionality regarding the growth and maintenance of the database can be included in the future. For example, the CSV-processing functions in the
shcopus
repo which can analyze CSV export from Scopus can be added to this repo, in case of Scopus API limitations.Yet other functionality might include ways of migrating the database, probably using SQLAlchemy's Alembic. These tools will enable the package to avoid re-populating the entire database, every time a change in the structure is needed.
elsametric
is not about creating and maintaining a webserver or an API. That should be the job of another repo. Hence, the filesmain.py
andapi_queries.py
are to be moved out of this repository.Any remaining script, whether Python or Jupyter Notebook, should be moved to the
helper_scripts
, and if they are not needed, to thejunkyard
directory. Eventually, thejunkyard
folder should be reviewed for any useful files and subsequently deleted from the repo... one should travel light!