nimh-dsst / sharestats-leo-notfork

importing leo's code to get around LFS restriction on forks
0 stars 1 forks source link

Database Development #6

Open joshlawrimore opened 1 week ago

joshlawrimore commented 1 week ago

Supplied: Code should be added to: https://github.com/nimh-dsst/sharestats-leo-notfork/tree/dev_2024 100 PDFs will be provided Metadata for populating provenance

  1. Create Documents, Works, and provenance table schema (sqlalchemy models) from the tables outlined here: https://docs.google.com/spreadsheets/u/6/d/10RpZ0DqPvWx4Et_-K7OtQ2Guqoby6-od3F9sux6p5XY/edit .
  2. Write a simple reusable script to create a postgres database (local docker container using compose)
  3. Write a script to do a SQL dump to back up DB state, eventually will store on s3. Should be able to target DB that the backup is restored to.
  4. Write a script to upload the 100 pdfs to s3 and populate the appropriate tables: document (for each pdf), work (representing a distinct publication), and provenance (representing history of each table record)
  5. Database creation, population, backup, and restoration should be reproducible on another computer
  6. Pytest should be started and expanded for utility scripts, database integrity, and DB models
  7. Deploy to a AWS postgres and add github actions for preliminary management
quang-ng commented 3 days ago

I'm working on this issue, at PR: https://github.com/nimh-dsst/sharestats-leo-notfork/pull/7

joshlawrimore commented 2 days ago

I'm working on this issue, at PR: #7

Modified from @leej3's message on slack to use current column names generated by code https://github.com/quang-ng/sharestats-leo-notfork/tree/issue-6. There was an incorrect name in the works table on Slack. The tables should be filled in with the following information:


Documents Table

-id: Auto-generated (by database upon creation).


Works Table


Provenance Table


joshlawrimore commented 2 days ago

Please write the upload scripts so that they can do the following:

  1. Upload the PDFs to a given bucket in S3. Assume the PDFs will be in a single directory or a single PDF. Function in script should also be able to take in a list of pdf filepaths.
  2. Upon proof of successful upload to S3, write the info in the documents table.
  3. Record of the successful upload is kept in the provenance table. A single provenance entry can refer to the upload of many PDFs, so the provenance_id can have a one to many relationship to document_id. The provenance table is a log of the upload script run, so should be run after the upload script.
  4. There is a one-to-many relationship between works_id and document_id since a publication can have different pdf versions. This script should be able to be run independently from the upload script.

@quang-ng, great work on getting the alembic code working for the creation of the tables. I look forward to your next update!