paul-tqh-nguyen / arxiv_as_a_newspaper

arxiv.org portrayed as if it were a news paper.
0 stars 0 forks source link

Implement initial support for our ETL #2

Closed paul-tqh-nguyen closed 5 years ago

paul-tqh-nguyen commented 5 years ago

Our README specifies two commands for running our ETL process.

Currently, they are:

./arxiv_as_a_newspaper -run-etl-process

and

./arxiv_as_a_newspaper -write-etl-results-to-file <destiation.json>

This issue can be marked as completed as soon as some non-egregious and satisfactory functional implementation of both commands comes into existence.

paul-tqh-nguyen commented 5 years ago

Progress Patch: https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/dfc87e83062b694efb1f82848c372437bf66d680

This patch adds in some infrastructure for the CLI's arg parsing and stubs for the following:

paul-tqh-nguyen commented 5 years ago

Progress Patch: https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/139fcb73ea60760f30f966eaa42b97cd832d249a

This patch includes stubs resulting in NotImplementedError instances being raised for the following two cases:

This includes changes to the CLI's help function, the arg parsing sanity checks, etc.

paul-tqh-nguyen commented 5 years ago

Though the latest changes are slightly out of the purview of this task, it was relevant enough to push in the changes without waiting for motivation to start on the remaining processes we want to support as the cost of doing so was significantly small.

paul-tqh-nguyen commented 5 years ago

Progress Patch: https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/8f68f16114ac9e30256423c7f884d857a177bd9e

This patch includes the initial commit of the backend utilities necessary to scrape https://arxiv.org/ for information on recent research papers.

This is a progress patch. There is a lot of clean up that is to do be done on the modules added in this patch.

paul-tqh-nguyen commented 5 years ago

Progress Patch: https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/1d0ac48c2db52a540fb0f71c22fa01bc4a068396

arxiv_recent_page_title_and_page_link_string_iterator is a utility in etl_utilities.py that scrapes the front page of https://arxiv.org/ for links to the "Recent Papers" page relevant to all the research fields whose papers are archived by https://arxiv.org/.

This patch adds documentation to arxiv_recent_page_title_and_page_link_string_iterator to explain exactly what it does.

paul-tqh-nguyen commented 5 years ago

Progress Patch: https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/4637f38f04cfaf24cd734650f2418ee62cb04eb8

This patch modifies etl_utilities.py to make the functions that are not useful outside of this module not accessible outside of this module.

Those functions were:

We consequently modified those that used these as well.

paul-tqh-nguyen commented 5 years ago

Progress Patch: https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/085f3464c8951eec7e5f5a63422da1c5d75fcc6b

A lot of the functionality in the "Recent" Pages scraping utilities isn't useful outside of the public interface to our ETL utilities, so we chose to make those not accessible outside of the modules in which they are defined.

The functionality changed include:

Due to the renames / access privilege changes, we had to accordingly change callers as well, e.g. extract_info_from_recent_page_url_as_json

This patch also includes a documentation extension on extract_info_from_recent_page_url_as_json.

We also added missing imports.

paul-tqh-nguyen commented 5 years ago

I believe the next step would be to create another directory in the utilities directory called etl_utilities and move all the functionality we have in etl_utilities.py right now into a new module located in the to-be-created extract_and_transform_utilities.py.

etl_utilities.py will be repurposed to only contain the high-level interface used by arxiv_as_a_newspaper.py.

After this is all done, we should add a new module called load_utilities.py that will contain utilities to take what's given to us by the functionality in extract_and_transform_utilities.py and load it into our DB.

No functionality in extract_and_transform_utilities.py will rely on any in load_utilities.py and vice-versa. The connection between them will be the functionality in etl_utilities.py that glues the two together.

EDIT: On second thought, perhaps the new sub-directories are unnecessary; the modularity via new libraries I think is still a good idea.

paul-tqh-nguyen commented 5 years ago

Progress Patch: https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/dbc4b22552736ea053dfb35018573bb36772b3e6

It was decided in #2 that we should separate out the extraction and transformation (i.e. the https://arxiv.org/ scraping) functionality from the loading functionality (i.e. the functionality that writes to our MongoBD).

This patch takes the first step in doing that by renaming the modules.

paul-tqh-nguyen commented 5 years ago

Progress Patch: https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/95f4cc48aeefd8ae7e359b83a9b23fee51b64a21

In https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/dbc4b22552736ea053dfb35018573bb36772b3e6 we decided to rename utilities/etl_utilities.py -> utilities/extract_transform_utilities.py.

That commit included the shallow renaming, but not a pervasive renaming throughout the contents of utilities/extract_transform_utilities.py.

This patch does the latter mostly via doc string and comment updates.

paul-tqh-nguyen commented 5 years ago

Progress Patch: https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/4726ff5f2f26cb3b11855349113ab793a1b3c0b3

This patch contains the initial commit of utilities/load_utilities.py, which contains mostly accessors to our external MongoDB set up on Atlas.

Currently, we have support for reading from the DB.

paul-tqh-nguyen commented 5 years ago

Progress Patch: https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/b1c0d914277468c65b90d58503dd535f1ae7c841 Progress Patch: https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/d19d3ad1ccdd74da47040caf2a34033420ef2403

These patches make it so that we handle bad authentication credentials (when we're attempting to read from our MongoDB) in a graceful fashion by asking the user for credentials repeatedly (up to 5 times) until something works (or we quit after 5 attempts).

paul-tqh-nguyen commented 5 years ago

Progress Patch: https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/93bde96d9d62741b89553bd04bd081b82ed510a1

This patch commits basic functionality to write to our DB.

No sanity checking or robustness is here yet.

No testing has taken place.

This merely has one case working.

paul-tqh-nguyen commented 5 years ago

Progress Patch: https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/a76f1c9f1a5059f5d9931b33e6f2cd527e4c0ca0

We intend that only our ETL process goes something like this:

  1. Scrape https://arxiv.org/ for recent paper info.
  2. Process it.
  3. Clear the DB.
  4. Write out the processed scrapings to the DB.

Our ETL process is this way since we intend that our DB only contain informaiton about recent papers.

This patch commits step 3 above.

paul-tqh-nguyen commented 5 years ago

Patch Progress: https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/79efe280eaf8f377be0174f658a4877d245bf066

As we've completed our load utilities, we've realized that there's no need for us to write functionality that converts our extracted and transformed data into JSON. We need to merely internally represent them as Python dictionaries and pymongo will handle the rest. Thus, we've updated the utilities in extract_transform_utilities.py to return Python dictionaries instead of JSON strings.

paul-tqh-nguyen commented 5 years ago

Progress Patch: https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/843656f4fd1c944079400c0d8af09edace826259

run_etl_process is the utility that will ultimately be the functionality behind this top-level CLI command:

./arxiv_as_a_newspaper -run-etl-process

See the README for more details.

This patch adds etl_processing_utilities.py, which utilizes load_utilities.py and extract_transform_utilities.py.

paul-tqh-nguyen commented 5 years ago

Progress Patch: https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/a84e05702cedd29e0a54e271d65cf05ccd29b988

This patch makes it so that this command now does something.

./arxiv_as_a_newspaper.py -run-etl-process

This command is now functional.

It extracts info about recent papers from arXiv.

It processes it.

It clears the DB.

It loads the processed data into the DB.

paul-tqh-nguyen commented 5 years ago

Progress Patch: https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/fd779edb7780b8c780389d91cd3a113a2a149ecc

We had some namespace and importing issues to resolve wrt package placement.

This patch addresses those and makes the following work more cleanly:

./arxiv_as_a_newspaper.py -run-etl-process
paul-tqh-nguyen commented 5 years ago

I think we can mark half of this task as done!


pnguyen@pnguyenmachine:~/code/arxiv_as_a_newspaper$ ./arxiv_as_a_newspaper.py -run-etl-process
Currently extracting information for research papers relevant to Astrophysics of Galaxies.
Number of research papers found for Astrophysics of Galaxies: 25

Currently extracting information for research papers relevant to Cosmology and Nongalactic Astrophysics.
Number of research papers found for Cosmology and Nongalactic Astrophysics: 25

Currently extracting information for research papers relevant to Earth and Planetary Astrophysics.
Number of research papers found for Earth and Planetary Astrophysics: 25

Currently extracting information for research papers relevant to High Energy Astrophysical Phenomena.
Number of research papers found for High Energy Astrophysical Phenomena: 25

Currently extracting information for research papers relevant to Instrumentation and Methods for Astrophysics.
Number of research papers found for Instrumentation and Methods for Astrophysics: 25

Currently extracting information for research papers relevant to Solar and Stellar Astrophysics.
Number of research papers found for Solar and Stellar Astrophysics: 25

Currently extracting information for research papers relevant to Disordered Systems and Neural Networks.
Number of research papers found for Disordered Systems and Neural Networks: 24

Currently extracting information for research papers relevant to Materials Science.
Number of research papers found for Materials Science: 25

Currently extracting information for research papers relevant to Mesoscale and Nanoscale Physics.
Number of research papers found for Mesoscale and Nanoscale Physics: 25

Currently extracting information for research papers relevant to Other Condensed Matter.
Number of research papers found for Other Condensed Matter: 13

Currently extracting information for research papers relevant to Quantum Gases.
Number of research papers found for Quantum Gases: 24

Currently extracting information for research papers relevant to Soft Condensed Matter.
Number of research papers found for Soft Condensed Matter: 25

Currently extracting information for research papers relevant to Statistical Mechanics.
Number of research papers found for Statistical Mechanics: 25

Currently extracting information for research papers relevant to Strongly Correlated Electrons.
Number of research papers found for Strongly Correlated Electrons: 25

Currently extracting information for research papers relevant to Superconductivity.
Number of research papers found for Superconductivity: 25

Currently extracting information for research papers relevant to Adaptation and Self-Organizing Systems.
Number of research papers found for Adaptation and Self-Organizing Systems: 7

Currently extracting information for research papers relevant to Cellular Automata and Lattice Gases.
Number of research papers found for Cellular Automata and Lattice Gases: 7

Currently extracting information for research papers relevant to Chaotic Dynamics.
Number of research papers found for Chaotic Dynamics: 16

Currently extracting information for research papers relevant to Exactly Solvable and Integrable Systems.
Number of research papers found for Exactly Solvable and Integrable Systems: 13

Currently extracting information for research papers relevant to Pattern Formation and Solitons.
Number of research papers found for Pattern Formation and Solitons: 12

Currently extracting information for research papers relevant to Accelerator Physics.
Number of research papers found for Accelerator Physics: 10

Currently extracting information for research papers relevant to Applied Physics.
Number of research papers found for Applied Physics: 25

Currently extracting information for research papers relevant to Atmospheric and Oceanic Physics.
Number of research papers found for Atmospheric and Oceanic Physics: 14

Currently extracting information for research papers relevant to Atomic and Molecular Clusters.
Number of research papers found for Atomic and Molecular Clusters: 9

Currently extracting information for research papers relevant to Atomic Physics.
Number of research papers found for Atomic Physics: 22

Currently extracting information for research papers relevant to Biological Physics.
Number of research papers found for Biological Physics: 22

Currently extracting information for research papers relevant to Chemical Physics.
Number of research papers found for Chemical Physics: 25

Currently extracting information for research papers relevant to Classical Physics.
Number of research papers found for Classical Physics: 16

Currently extracting information for research papers relevant to Computational Physics.
Number of research papers found for Computational Physics: 25

Currently extracting information for research papers relevant to Data Analysis, Statistics and Probability.
Number of research papers found for Data Analysis, Statistics and Probability: 17

Currently extracting information for research papers relevant to Fluid Dynamics.
Number of research papers found for Fluid Dynamics: 25

Currently extracting information for research papers relevant to General Physics.
Number of research papers found for General Physics: 14

Currently extracting information for research papers relevant to Geophysics.
Number of research papers found for Geophysics: 14

Currently extracting information for research papers relevant to History and Philosophy of Physics.
Number of research papers found for History and Philosophy of Physics: 6

Currently extracting information for research papers relevant to Instrumentation and Detectors.
Number of research papers found for Instrumentation and Detectors: 23

Currently extracting information for research papers relevant to Medical Physics.
Number of research papers found for Medical Physics: 12

Currently extracting information for research papers relevant to Optics.
Number of research papers found for Optics: 25

Currently extracting information for research papers relevant to Physics and Society.
Number of research papers found for Physics and Society: 25

Currently extracting information for research papers relevant to Physics Education.
Number of research papers found for Physics Education: 6

Currently extracting information for research papers relevant to Plasma Physics.
Number of research papers found for Plasma Physics: 24

Currently extracting information for research papers relevant to Popular Physics.
Number of research papers found for Popular Physics: 7

Currently extracting information for research papers relevant to Space Physics.
Number of research papers found for Space Physics: 10

Currently extracting information for research papers relevant to Algebraic Geometry.
Number of research papers found for Algebraic Geometry: 25

Currently extracting information for research papers relevant to Algebraic Topology.
Number of research papers found for Algebraic Topology: 20

Currently extracting information for research papers relevant to Analysis of PDEs.
Number of research papers found for Analysis of PDEs: 25

Currently extracting information for research papers relevant to Category Theory.
Number of research papers found for Category Theory: 15

Currently extracting information for research papers relevant to Classical Analysis and ODEs.
Number of research papers found for Classical Analysis and ODEs: 15

Currently extracting information for research papers relevant to Combinatorics.

Still processing, but manual testing says that this will go without a hitch. Let's see what happens.

:)

paul-tqh-nguyen commented 5 years ago

https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/issues/5 shows that we get throttled pretty quickly.

Let's complete https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/issues/6 before we attempt to make any more progress on this ticket.

paul-tqh-nguyen commented 5 years ago

Progress Patch: https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/3090c99ff42e36026b38d56fccfbde3ca98fef6f

This patch makes it so that our ETL process will be robust against cases where we're given a URL that currently cannot be reached.

This is done through _safe_get_text_at_url (formerly named _get_text_at_url).

It will return the emptry string after 5 attempts at reaching the page.

paul-tqh-nguyen commented 5 years ago

I think we've gotten our ETL working enough that we can close this ticket.

Testing is still going forward.