Database Development - Githubissues

joshlawrimore commented 1 week ago

Supplied: Code should be added to: https://github.com/nimh-dsst/sharestats-leo-notfork/tree/dev_2024 100 PDFs will be provided Metadata for populating provenance

Create Documents, Works, and provenance table schema (sqlalchemy models) from the tables outlined here: https://docs.google.com/spreadsheets/u/6/d/10RpZ0DqPvWx4Et_-K7OtQ2Guqoby6-od3F9sux6p5XY/edit .
Write a simple reusable script to create a postgres database (local docker container using compose)
Write a script to do a SQL dump to back up DB state, eventually will store on s3. Should be able to target DB that the backup is restored to.
Write a script to upload the 100 pdfs to s3 and populate the appropriate tables: document (for each pdf), work (representing a distinct publication), and provenance (representing history of each table record)
Database creation, population, backup, and restoration should be reproducible on another computer
Pytest should be started and expanded for utility scripts, database integrity, and DB models
Deploy to a AWS postgres and add github actions for preliminary management

quang-ng commented 3 days ago

I'm working on this issue, at PR: https://github.com/nimh-dsst/sharestats-leo-notfork/pull/7

joshlawrimore commented 2 days ago

I'm working on this issue, at PR: #7

Modified from @leej3's message on slack to use current column names generated by code https://github.com/quang-ng/sharestats-leo-notfork/tree/issue-6. There was an incorrect name in the works table on Slack. The tables should be filled in with the following information:

Documents Table

-id: Auto-generated (by database upon creation).

hash: This is just a hash of the PDF bytes/file.
created_at: Autogenerated during script execution (or using a default factory).
s3uri: The URI where the PDF file is stored for later use (AWS credentials will be provided to upload each PDF).
work_id: (Previously publication_id) A reference to an entry in the "Works" table. A "work" represents a distinct academic publication (it may be accessible via different URLs, such as at the Nature Neuroscience journal's website and in PubMed Central).
provenance_id: A reference to a record in the provenance table. This will be very sparsely populated.

Works Table

id: Auto-generated (by database upon creation).
created_at: Auto-generated (default factory).
modified_at: Auto-generated.
initial_document_id: A reference to the Documents table.
primary_document_id: For now, this will be the same as the initial document ID.
provenance_id: Another reference to an entry in the provenance table, also very sparsely populated.

Provenance Table

id: Auto-generated.
pipeline_name: Possible values include "work initialization" and "document upload" for the two records.
version: Currently arbitrary; dsst_etl._version is likely the most appropriate.
compute: See code reference.
personnel: Use the host's username for now.
comment: Optional argument in the CLI.

joshlawrimore commented 2 days ago

Please write the upload scripts so that they can do the following:

Upload the PDFs to a given bucket in S3. Assume the PDFs will be in a single directory or a single PDF. Function in script should also be able to take in a list of pdf filepaths.
Upon proof of successful upload to S3, write the info in the documents table.
Record of the successful upload is kept in the provenance table. A single provenance entry can refer to the upload of many PDFs, so the provenance_id can have a one to many relationship to document_id. The provenance table is a log of the upload script run, so should be run after the upload script.
There is a one-to-many relationship between works_id and document_id since a publication can have different pdf versions. This script should be able to be run independently from the upload script.

@quang-ng, great work on getting the alembic code working for the creation of the tables. I look forward to your next update!

nimh-dsst / sharestats-leo-notfork

Database Development #6

Documents Table

Works Table

Provenance Table