hongsudt commented 7 months ago

Requirements

Three holdings files to be generated:

<scratch>/pdb_ihm/holdings/current_file_holdings.json.gz
<scratch>/pdb_ihm/holdings/released_structures_last_modified_dates.json.gz
<scratch>/pdb_ihm/holdings/unreleased_entries.json.gz

For each released entry, the following files are to be transferred:

<scratch>/pdb_ihm/data/entries/{hash}/{entry_id}/structures/{entry_id}.cif.gz  
<scratch>/pdb_ihm/data/entries/{hash}/{entry_id}/validation_reports/{entry_id}_full_validation.pdf.gz
<scratch>/pdb_ihm/data/entries/{hash}/{entry_id}/validation_reports/{entry_id}_summary_validation.pdf.gz

# `{hash}` = 2nd and 3nd character of 4-character PDB accession code. 
#  `{entry_id}` is the 4-character PDB accession code (lower case)

create a function: get_hash(Mode=PDB(default)|PDBDEV):
- if Mode == PDB, return 2nd and 3nd character of PDB_Accession_Code (lower case), else return 2nd and 3nd character of PDB_Accession_Code (lower case)

Model changes

Crate a new table PDB.PDB_Archive with the following column
- Submission_Time: timestamptz --> get from getArchiveDate(new())
- Submitted_Entries : number of submitted entries for this archival submission event
- Current_File_Holdings_URL (The current listing of entries and files in the pdb-ihm archive)
- other Current_File_Holdings asset columns (Current_File_Holdings_Name|MD5|Bytes)
- Released_Structures_LMD_URL (The last modified date for all released entries in the pdb-ihm archive)
- other Released_Structure_LMD asset columns
- Unreleased_Entries_URL (Metadata about the list of unreleased entries i.e., the entries on HOLD with accession codes issues)
- other Unreleased_Entries asset columns
- Key: (Submission_Time)
- Hatrac namespace: /hatrac/pdb/generated/archive/pdb_ihm/holdings/<year>/<Submission_Date>
Crate a new table PDB.Entry_Latest_Archive with the following column
- Entry: fkey to Entry.RID
- mmCIF_URL: text --> versioned_mmCIF last archived (Note: This is better than MD5)
- Submitted_Files: jsonb (keys=Archive_Category, values=list of files) e.g. {"mmcif": ["/hatrac/pdb/..../entry_1234.mmcif"], "validation_report": ["/hatrac/pdb/.../report1.pdf", "/hatrac/pdb/.../report1_summary.pdf"]
- contains the hatrac URL of the system generated files downloaded for the archive folder (before gzip the content).
- Submission_Time: timestamptz
- Archive: text: fkey to PDB.PDB_Archive table
- Key: Entry
- Fkey: (Archive, Submission_Time) --> PDB_Archive(RID, Submission_Time)
Create a new Vocab.Archive_Category
- Name (Key) --> The key used in the Current Holding manifest docs
- Description
- Directory_Name --> Corresponding directory name used for holding documents under this category
Add a column Archive_Category to Vocab.System_Generated_File_Type to identify different categories that the doc is supposed to file into when archived to PDB. When it is unset, it means DO NOT ARCHIVE.
- column comment = "Document categories associated with wwPDB archive specification. NULL indicates that there is no corresponding category and therefore this file type SHOULD NOT be archived. "
- Update the rows:
```
# Insert into Vocab.Archive_Category
# Name, Directory_Name
mmCIF, structures
validation_report, validation_reports
```

Update System_Generated_File_Type

"mmCIF" -- "mmcif" "Validation: Full PDF" -- "validation_report" "Validation: Summary PDF" -- "validation_report"


### Pipeline
#### Generating entries directories to be archived 
- Create the following helper functions: 
  - getArchiveDate(datetime). This should be the date of Friday of the week.
    - reference_datetime = "xxx"    # pick one Thur at 8:00 PM PT / 11:00 PM ET (UTC may be on Friday)
    - convert reference_datetime to seconds since epoch
    - seconds_in_one_week = calculate number of seconds in a week
    - to_submit_seconds = current datetime mod seconds_in_one_week
    - submission date = current date + to_submit_seconds 
    - Note: with this approach, the week starts after the referenced datetime (e.g. Thur 17:00 ET). 
  - getHash(accession_id):
     - if accession_id is 4 characters,  returns 2 and 3 characters. We will extend this function to address the case when it matches `pdb_xxxx` later. 
     - else returns `000` (this is what we agreed upon for pdbdev sometime ago)
  - getFileArchiveSubDirectory(entry_rid, archive_category):  /* archive_category is what defined in the "System_Generate_File_Type" which also contains the directory names */

parameters or config

scratch_dir =

set up archive_category_dir_names at the beginning of pipeline. Set once, use throughout

archive_directory = "pdb_ihm" holding_directory = "%s/holdings" % (archive_directory) data_directory = "%s/data" % (archive_directory) archive_category_dir_names is derived from ermrest/catalog/1/attribute/Vocab.Archive_Category/Name,Directory_Name. Key is the Name, and Value is Directory_Name

note: entry_id should be the entry's PDB Accession Code. @brindakv Is this righ?, If so, we can change entry_id to entry_accession_code

def getFileArchiveSubDirectory(entry_id, archive_category): global archive_category_dir_names
global data_dir

entry_dir = {data_directory}/entries/{hash}/{accession_code}

entry_dir = %s/entries/%s/%s/ % (data_directory, getHash(entry_id), entry_id) return entry_dir + archive_category_dir_names[archive_category]


- Create a dictionary entry2archived: read from Entry_Latest_Archive table to get the latest entries. The dict looks like:

/ermrest/catalog/2/entity/E:=PDB:Entry_Archive/

entry2archived: { "entry1": {"RID": "xx", "Entry":"xx", "mmCIF_URL":"xx", "Submitted_Files":"",..., "Submission_Time":"xx"} }

- Generate a new dict entry2upload: list of entries to be archived: newly released or those with updated mmCIF files. The structure should be the same as entry2archived. The query to use is below:

entry2upload: a list of entry ids

/ermrest/catalog/2/attribute/E:=PDB:Entry/Status=REL/ F:=(structure_id)=(PDB:System_Generated_File:structure_id)/File_Type=mmCIF/ A:=(structure_id)=left(PDB:Entry_Latest_Archive)/mmCIF_URL::null::;mmCIF_URL::neq::F:File_URL/$F/E:structure_id,F:File_URL,...

entry2upload_files: a list of files to be archived. Note: WIll have to address URL-length limitation of there are many RIDs

Question: @brindakv will we ever go from HOLD to REL?

/ermrest/catalog/2/entities/V:=Vocab:System_Generated_File_Type/!Archive_Category::null::/F:=(Name)=(PDB:System_Generated_Files)/Entry=Any(",".join(entry2upload.keys()))

Note: entry Submission_Time of these entries should be set to getArchiveDate(Now())


- Prepare the data folder according to the list above 
  - Download files from hatrac
  - GZip all files
  - Put them in proper directories
  - Update entry2upload.Uploaded_Files appropriately
- Note: We will update the Entry_Latest_Archive table at the end when all files are generated properly. This is to make sure that if something is wrong, we can regenerate. 

#### Generating manifest files
- @brindakv please check all the filenames e.g. capital, plurals, etc.)
- Update entry2archived dict with entry2upload dict e.g. `rid2archived.update(entry2upload)` --> the old entries that's in the original entry2archive is updated with new metadata and the new ids get added to the dict.

- Prepare **Released_Structures_LMD** (released_structures_last_modified_dates.json.gz). Generate the file based on the entry2archived dict using the Submission_Time column 

- Generate **Current_file_holdings** (current_file_holdings.json.gz)  from entry2archived dict

When generate the path, assume that data path is at root e.g.

for entry in entry2archived: for category, files in entry2archived["Submitted_Files"].items(): for file in files: path = "/%s/%s/%s" % (getFileArchiveSubDirectory(entry["Accession_Code"], category), category, file)


- Generate **Unreleased_Entries** (unreleased_entries.json.gz): query the Entry table with `Status=HOLD`

ernrest query to get a list of holding entries

/ermrest/catalog/2/attribute/E:=PDB:Entry/Status=HOLD/

- Prepare folder of manifest files
  - Gzip all files (or make sure files are gziped)
  - Put them in proper folder structure

#### Update booking keeping tables
- Create a new entry for `PDB.PDB_Archive` with `Submission_Time`, `Number_Of_Entries`, and the proper file URLS
-  Update PDB.Entry_Latest_Archive (insert_if_exist_update). At the end, the Entry_Latest_Archive table have new entries inserted or those with modified mmCIF file md5 updated. 

#### directories conventions

/pdb_ihm
- holdings
  - current_file_holdings.json.gz
  - released_structures_last_modified_dates.json.gz
  - unreleased_entries.json.gz
- data
  - entries/{xx}/{entry_id}
    - structures --> mmcif files
    - validation_reports --> full and summary versions of the pdb-ihm validation report

File 1: released_structures_last_modified_dates.json

Example: https://files.wwpdb.org/pub/pdb/holdings/released_structures_last_modified_dates.json.gz. See issue #204.

{
<PDB:Accession_Code.Accession_Code> : <Date/time when the entry (and all associated files) were last submitted to the exchange area provided by RCSB PDB.>
}

File 2: current_file_holdings.json

Example: https://files.wwpdb.org/pub/pdb/holdings/current_file_holdings.json.gz

{
"8zz1" : {
"mmcif": [
    "/pdb_ihm/data/entries/zz/8zz1/structures/8zz1.cif.gz"
],
"validation_report": [
    "/pdb_ihm/data/entries/zz/8zz1/validation_reports/8zz1_full_validation.pdf.gz",
    "/pdb_ihm/data/entries/zz/8zz1/validation_reports/8zz1_summary_validation.pdf.gz"
]
},
"8zz2" : {
"mmcif": [
    "/pdb_ihm/data/entries/zz/8zz2/structures/8zz2.cif.gz"
],
"validation_report": [
    "/pdb_ihm/data/entries/zz/8zz2/validation_reports/8zz2_full_validation.pdf.gz",
    "/pdb_ihm/data/entries/zz/8zz2/validation_reports/8zz2_summary_validation.pdf.gz"
]
}
}

File 3: unreleased_entries.json

Example: https://files.wwpdb.org/pub/pdb/holdings/unreleased_entries.json.gz

{
"8zz3" : {
"status_code" : "HOLD",
"deposit_date" : "2024-01-22T00:00:00+00:00",
"prerelease_sequence_available_flag" : "N"
},
"8zz4" : {
"status_code" : "HOLD",
"deposit_date" : "2024-02-05T00:00:00+00:00",
"prerelease_sequence_available_flag" : "N"
}
}

cron job

Two scripts in two cron job run every Thursday:
- Archive generation script - run only once every week at 7:00 PM PT / 10:00 PM ET (on production only)
- Rsync script to submit the archive directory to the designated location (with --delete option) - run only once every week - 9:00 PM PT / 12:00 AM ET (on production only)

Error handling

If the system generated filenames (mmCIF, JSON, Validation reports) are inconsistent with entry.Accession_Code or if system generated files are missing, exclude the entry from REL/HOLD for the week and send email with error message so that curators can fix it for next week's release
- Email title: PDB-Dev weekly archive error: {entry_RIDs} (upto 3 and more)
- Email content: The following entries were not archived due to inconsistent filenames or missing system generated files: List all entry_RIDs and corresponding workflow status. Example: 1-PQRS: HOLD; 2-PQRS: REL
Check for mandatory files (i.e. mmCIF, 2 validation reports). If mandatory files are missing, don't include the entry in the manifest. Notify users.

ACL needs to be updated for the new tables

brindakv commented 3 months ago

The following issues are found:

The three *.gz holdings files cannot be unzipped.
- The three holdings files have to be updated incrementally and include all previous updates (unlike the mmCIF files and pdf validation reports in the data/entries sub-folder
- In all three holdings files, the keys are 4-digit PDB accession codes in upper case and the filenames should be what is in hatrac (should be uppercase for PDB-Dev filenames and lower case for PDB filenames in hatrac)
- The organization of sub-directories in the data/entries` sub-folder has to follow the PDB ID format:
- /pdb_ihm/data/entries/{xx}/{PDB_ID.lower()}/structures/{PDB_ID.lower()}.cif.gz
- /pdb_ihm/data/entries/{xx}/{PDB_ID.lower()}/validation_reports/{PDB_ID.lower()}_full_validation.pdf.gz
- /pdb_ihm/data/entries/{xx}/{PDB_ID.lower()}/validation_reports/{PDB_ID.lower()}_summary_validation.pdf.gz
- Right now, PDB IDs and PDB-Dev IDs are mixed. Upper case PDB IDs are used in directory names and upper case PDB-Dev IDs are used in filenames.
- The {xx} directory name is incorrect. The 2nd and 3rd characters of the PDB IDs have to be used for the directory names.
- Fixing these may require switching to PDB Accession Code mode

svoinea commented 3 months ago

@brindakv What do you mean by *.gz holdings files cannot be unzipped? I have copied them on my Mac, double clicked them, and the *.json files are there.

For the filenames, I am using as they are in the Entry_Generated_File table and select only those for which filename == <Accession_Code>.

For the xx, the issue specifies to use getHash which returns 000 for the accession_code_mode == PDBDEV.

Are your comments referring only for the accession_code_mode == PDB case?

I am confused at this time with the corrections I need to do for the accession_code_mode == PDBDEV case before switching to the accession_code_mode == PDB case.

brindakv commented 3 months ago

Update annotations to update visibility of new tables to user via the menu.

aozalevsky commented 3 months ago

@svoinea current IHMValidation version is:

https://github.com/salilab/IHMValidation/tree/v1.2 or https://github.com/salilab/IHMValidation/commit/48c8415ddcf19e6a5f607e96bf8a21b0e05bf5c3

svoinea commented 3 months ago

Seems to me a misunderstanding. If v1.2 is a tag, then:

git log -1
commit 86c5267ce30904cc62907e195ea34f33858127cc (HEAD, tag: v1.2)
Author: Arthur Zalevsky <aozalevsky@gmail.com>
Date:   Mon Jul 22 20:10:39 2024 -0700

If taking the commit number:

git log -1
commit 48c8415ddcf19e6a5f607e96bf8a21b0e05bf5c3 (HEAD, origin/release-1-2)
Author: Arthur Zalevsky <aozalevsky@gmail.com>
Date:   Mon Jul 29 10:59:23 2024 -0700

So, production has now the commit 48c8415ddcf19e6a5f607e96bf8a21b0e05bf5c3 (HEAD, origin/release-1-2) as it is more recently.

aozalevsky commented 3 months ago

@svoinea no confusion. i move the release tag to the most recent tag in the release branch. if you were to deploy from scratch today, v1.2 would pull 48c841. anyway, 48c841 is a correct tag.

brindakv commented 3 months ago

Update filename current_holdings.json.gz to current_file_holdings.json.gz.

Update the contents of this file to include validation_report after mmcif.

hongsudt commented 2 months ago

code changes to support idempotent

Assuming that T1 is this week cutoff time (e.g. used for setting Submission_Time). If we use the following Ermrest queries, this should allow us to run the script multiple times during the submission week. Note: If we want to change the submission datetime policy, do not run any script of the new week before changing the setting.

Notes: if the reference time need to be changed, it should be more than 7 days after the latest submission time in the system. Otherwise, we need to modify the RCT of re-release entries that were submitted on previous T0 to be the current T0, then change the RCT back after this cycle to reflect actual event.

# get a list of entries that were previously archived for entry2archived (same as above):
/ermrest/catalog/2/entity/E:=PDB:Entry_Latest_Archive/

# entry2upload: a list of entry ids to upload
The earlier query doesn't work since ermrest doesn't support substitution in the comparison statement. 
We will need to do 2 queries assuming that there is now a "Structure_Id" column in Entry_Latest_Archive:

# 1) Check for new released entries e.g. new REL entries (released after T0). Note: the new entries released this cycle will be returned with null constraint  if the script runs the first time (e.g. it hasn't been picked up before). The subsequent runs of the script will get the new releases from Entry_Latest_Archived entries that were generated within this cycle (e.g. RCT > T0 and Submittion_Time=T1) 

https://dev-aws.pdb-dev.org/ermrest/catalog/50/attribute/
E:=PDB:entry/Workflow_Status=REL/
F:=(id)=(PDB:Entry_Generated_File:Structure_Id)/File_Type=mmCIF/
A:=left(E:RID)=(PDB:Entry_Latest_Archive:Entry)/A:RID::null::;(A:RCT::gt::<T0>&A:Submission_Time=<T1>)/
$E/E:RID,E:id,...

# concrete example.. This works
https://dev-aws.pdb-dev.org/ermrest/catalog/50/attribute/E:=PDB:entry/Workflow_Status=REL/F:=(id)=(PDB:Entry_Generated_File:Structure_Id)/File_Type=mmCIF/A:=left(E:RID)=(PDB:Entry_Latest_Archive:Entry)/A:RID::null::;(A:RCT::gt::2024-08-22%2020%3A00%3A00-07%3A00&A:Submission_Time=2024-08-29%2020%3A00%3A00-07%3A00)/$E/E:RID,E:id,E:Deposit_Date,E:Accession_Code,F:File_URL,A:Entry,A:Submission_Time,A:mmCIF_URL

2) Get a list of re-released entries e.g. entries that mmCIF contents have changed since T0

https://dev-aws.pdb-dev.org/ermrest/catalog/50/attribute/
A:=PDB:Entry_Latest_Archive/A:RCT::leq::<T0>/
E:=(A:Entry)=(PDB:entry:RID)/Workflow_Status=REL/
F:=left(A:mmCIF_URL)=(PDB:Entry_Generated_File:File_URL)/F:RID::null::;(F:File_Type=mmCIF&A:Submission_Time=<T1>)/
$A/A:Entry,Archived_mmCIF_URL:=A:mmCIF_URL,Archived_Time:=A:Submission_Time

# concrete example. 
# -- waiting to test during 08/29 cycle:
https://dev-aws.pdb-dev.org/ermrest/catalog/50/attribute/A:=PDB:Entry_Latest_Archive/A:RCT::leq::2024-08-22T20%3A00%3A00-07%3A00/E:=(A:Entry)=(PDB:entry:RID)/Workflow_Status=REL/F:=left(mmCIF_URL)=(PDB:Entry_Generated_File:File_URL)/F:RID::null::;(F:File_Type=mmCIF&A:Submission_Time=2024-08-29%2020%3A00%3A00-07%3A00)/$A/E:RID,E:id,E:Deposit_Date,E:Accession_Code,F:File_URL,F:File_Name,A:Entry,A:Submission_Time,A:mmCIF_URL

# -- this URL is no longer relevant (it only supports the cycle of 08/22)
https://dev-aws.pdb-dev.org/ermrest/catalog/50/attribute/A:=PDB:Entry_Latest_Archive/
E:=(Entry)=(PDB:entry:RID)/Workflow_Status=REL/$A/F:=left(mmCIF_URL)=(PDB:Entry_Generated_File:File_URL)/F:File_URL::null::;(A:Submission_Time=2024-08-22%2015%3A00%3A00-07%3A00&A:RCT::leq::2024-08-15%2015%3A00%3A00-07%3A00)/$A/A:Entry,Archived_mmCIF_URL:=A:mmCIF_URL,Archived_Time:=A:Submission_Time,A:RCT,A:RMT

Model changes

Add unique constraint to PDB:Entry_Latest_Archive:mmCIF_URL
Add unique constraint to PDB:Entry_Generated_File:File_URL
- Note: the unique constraints are important since the query above are using for joining.
Add another column called "Submission_History" (jsonb) to track the revision? This column is dictionary of Submission_Time to submission information e.g. mmCIF_URL and Submitted_Files. Before performing update on the existing "Entry_Latest_Archive", add the following jsonb to the column:
```
Assuming row is the existing Entry_Latest_Archive row before updating: 
row["Submission_History"].update({
row["Submission_Time"]: {
"mmCIF_URL": row["mmCIF_URL"], 
"Submitted_Files": row["Submitted_Files"]
}
}
```

@brindakv asked to add the two new columns to PDB_Archive table to help break down the submitted entries:

New_Released_Entries (int): Number of new releases in this cycle (value == # item from query 1 above)
Re_Released_Entries (int): Number of re-releases in this cycle (value == # item from query 2 above)

brindakv commented 2 months ago

Need to handle when there are no new releases in a given week (no new data files but holdings files should be updated with new a new time stamp but with the same information as previous week)
Need to handle when only validation reports are updated for an entry
Need to handle re-released entries (multiple times same week and re-release an entry released weeks ago)
Two cron jobs need to be set up - one for archive process and one for rsync to PDB

hongsudt commented 2 months ago

Add the following configuration parameters: cutoff_time_pacific: "Thursday 20:00"

hongsudt commented 2 months ago

@svoinea We will need to drop the btree indexes for "Submitted_Files" and "Submission_History" using psql.. We should turn off facet and sorting in the annotation for these columns as well since btree is now gone.

informatics-isi-edu / pdb-ihm

archive pipeline #209

Requirements

Model changes

Update System_Generated_File_Type

parameters or config

set up archive_category_dir_names at the beginning of pipeline. Set once, use throughout

note: entry_id should be the entry's PDB Accession Code. @brindakv Is this righ?, If so, we can change entry_id to entry_accession_code

entry_dir = {data_directory}/entries/{hash}/{accession_code}

entry2upload: a list of entry ids

entry2upload_files: a list of files to be archived. Note: WIll have to address URL-length limitation of there are many RIDs

Question: @brindakv will we ever go from HOLD to REL?

When generate the path, assume that data path is at root e.g.

ernrest query to get a list of holding entries

File 1: released_structures_last_modified_dates.json

File 2: current_file_holdings.json

File 3: unreleased_entries.json

cron job

Error handling

ACL needs to be updated for the new tables

code changes to support idempotent

Model changes