paul-tqh-nguyen / arxiv_as_a_newspaper

arxiv.org portrayed as if it were a news paper.
0 stars 0 forks source link

Figure out a way to DWIM mirror links in the DB to ones that hit the main site #8

Closed paul-tqh-nguyen closed 5 years ago

paul-tqh-nguyen commented 5 years ago

In https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/3e811b42f4f3c75f0ad5e450b8e9e801f661eb0d, we decided to hit mirror sites instead of the main site.

We thus store those mirrored links into our external DB.

We need to do some DWIMing somewhere (either prior to loading into the DB or as a post ETL clean up; I think the former would be preferred) to make it so that our front end will display the correct information.

paul-tqh-nguyen commented 5 years ago

Progress Patch: https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/ae6ee027d2ca860a7ae6b8df9b9c51663d93bdcb

When the ETL process loads docs to the DB, it does it in batches where the contents of each batch are currently extensionally defined to be all the research papers relevant to a general research field, e.g. Mathematics, (as opposed to a specific field, e.g. Combinatorics). We now make it so that the documents loaded to the DB also include this information.

This will help us iteratively clear small portions of the DB.

paul-tqh-nguyen commented 5 years ago

Progress Patch: https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/6db24d535a89cf060700ae9257484ebe665a1556

We used to have it such that our ETL would spend a huge amount of time scraping and then all at once update the DB.

Iterativity makes it so that we can more swiftly partially update the DB.

This is a progress patch.

More testing is necessary.

paul-tqh-nguyen commented 5 years ago

Progress Patch: https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/dc55a523099f96324f49f51d67628b4e89c497e7

We hit mirror pages to not get throttled or blocked.

When we stick that info in the DB, we want to make the links relative to the main site, not the mirrored sites.

This will make the front end look nicer.

This patch implements that.