informatics-isi-edu / pdb-ihm

Deriva Protein Database Project
3 stars 1 forks source link

archive pipeline #209

Open hongsudt opened 7 months ago

hongsudt commented 7 months ago

Requirements

Three holdings files to be generated:

<scratch>/pdb_ihm/holdings/current_file_holdings.json.gz
<scratch>/pdb_ihm/holdings/released_structures_last_modified_dates.json.gz
<scratch>/pdb_ihm/holdings/unreleased_entries.json.gz

For each released entry, the following files are to be transferred:

<scratch>/pdb_ihm/data/entries/{hash}/{entry_id}/structures/{entry_id}.cif.gz  
<scratch>/pdb_ihm/data/entries/{hash}/{entry_id}/validation_reports/{entry_id}_full_validation.pdf.gz
<scratch>/pdb_ihm/data/entries/{hash}/{entry_id}/validation_reports/{entry_id}_summary_validation.pdf.gz

# `{hash}` = 2nd and 3nd character of 4-character PDB accession code. 
#  `{entry_id}` is the 4-character PDB accession code (lower case)

Model changes

Update System_Generated_File_Type

"mmCIF" -- "mmcif" "Validation: Full PDF" -- "validation_report" "Validation: Summary PDF" -- "validation_report"


### Pipeline
#### Generating entries directories to be archived 
- Create the following helper functions: 
  - getArchiveDate(datetime). This should be the date of Friday of the week.
    - reference_datetime = "xxx"    # pick one Thur at 8:00 PM PT / 11:00 PM ET (UTC may be on Friday)
    - convert reference_datetime to seconds since epoch
    - seconds_in_one_week = calculate number of seconds in a week
    - to_submit_seconds = current datetime mod seconds_in_one_week
    - submission date = current date + to_submit_seconds 
    - Note: with this approach, the week starts after the referenced datetime (e.g. Thur 17:00 ET). 
  - getHash(accession_id):
     - if accession_id is 4 characters,  returns 2 and 3 characters. We will extend this function to address the case when it matches `pdb_xxxx` later. 
     - else returns `000` (this is what we agreed upon for pdbdev sometime ago)
  - getFileArchiveSubDirectory(entry_rid, archive_category):  /* archive_category is what defined in the "System_Generate_File_Type" which also contains the directory names */

parameters or config

scratch_dir =

set up archive_category_dir_names at the beginning of pipeline. Set once, use throughout

archive_directory = "pdb_ihm" holding_directory = "%s/holdings" % (archive_directory) data_directory = "%s/data" % (archive_directory) archive_category_dir_names is derived from ermrest/catalog/1/attribute/Vocab.Archive_Category/Name,Directory_Name. Key is the Name, and Value is Directory_Name

note: entry_id should be the entry's PDB Accession Code. @brindakv Is this righ?, If so, we can change entry_id to entry_accession_code

def getFileArchiveSubDirectory(entry_id, archive_category): global archive_category_dir_names
global data_dir

entry_dir = {data_directory}/entries/{hash}/{accession_code}

entry_dir = %s/entries/%s/%s/ % (data_directory, getHash(entry_id), entry_id) return entry_dir + archive_category_dir_names[archive_category]


- Create a dictionary entry2archived: read from Entry_Latest_Archive table to get the latest entries. The dict looks like: 

/ermrest/catalog/2/entity/E:=PDB:Entry_Archive/

entry2archived: { "entry1": {"RID": "xx", "Entry":"xx", "mmCIF_URL":"xx", "Submitted_Files":"",..., "Submission_Time":"xx"} }

- Generate a new dict entry2upload: list of entries to be archived: newly released or those with updated mmCIF files. The structure should be the same as entry2archived. The query to use is below:

entry2upload: a list of entry ids

/ermrest/catalog/2/attribute/E:=PDB:Entry/Status=REL/ F:=(structure_id)=(PDB:System_Generated_File:structure_id)/File_Type=mmCIF/ A:=(structure_id)=left(PDB:Entry_Latest_Archive)/mmCIF_URL::null::;mmCIF_URL::neq::F:File_URL/$F/E:structure_id,F:File_URL,...

entry2upload_files: a list of files to be archived. Note: WIll have to address URL-length limitation of there are many RIDs

Question: @brindakv will we ever go from HOLD to REL?

/ermrest/catalog/2/entities/V:=Vocab:System_Generated_File_Type/!Archive_Category::null::/F:=(Name)=(PDB:System_Generated_Files)/Entry=Any(",".join(entry2upload.keys()))

Note: entry Submission_Time of these entries should be set to getArchiveDate(Now())


- Prepare the data folder according to the list above 
  - Download files from hatrac
  - GZip all files
  - Put them in proper directories
  - Update entry2upload.Uploaded_Files appropriately
- Note: We will update the Entry_Latest_Archive table at the end when all files are generated properly. This is to make sure that if something is wrong, we can regenerate. 

#### Generating manifest files
- @brindakv please check all the filenames e.g. capital, plurals, etc.)
- Update entry2archived dict with entry2upload dict e.g. `rid2archived.update(entry2upload)` --> the old entries that's in the original entry2archive is updated with new metadata and the new ids get added to the dict.

- Prepare **Released_Structures_LMD** (released_structures_last_modified_dates.json.gz). Generate the file based on the entry2archived dict using the Submission_Time column 

- Generate **Current_file_holdings** (current_file_holdings.json.gz)  from entry2archived dict

When generate the path, assume that data path is at root e.g.

for entry in entry2archived: for category, files in entry2archived["Submitted_Files"].items(): for file in files: path = "/%s/%s/%s" % (getFileArchiveSubDirectory(entry["Accession_Code"], category), category, file)


- Generate **Unreleased_Entries** (unreleased_entries.json.gz): query the Entry table with `Status=HOLD`

ernrest query to get a list of holding entries

/ermrest/catalog/2/attribute/E:=PDB:Entry/Status=HOLD/

- Prepare folder of manifest files
  - Gzip all files (or make sure files are gziped)
  - Put them in proper folder structure

#### Update booking keeping tables
- Create a new entry for `PDB.PDB_Archive` with `Submission_Time`, `Number_Of_Entries`, and the proper file URLS
-  Update PDB.Entry_Latest_Archive (insert_if_exist_update). At the end, the Entry_Latest_Archive table have new entries inserted or those with modified mmCIF file md5 updated. 

#### directories conventions

File 1: released_structures_last_modified_dates.json

Error handling

ACL needs to be updated for the new tables

brindakv commented 3 months ago

The following issues are found:

svoinea commented 3 months ago

@brindakv What do you mean by *.gz holdings files cannot be unzipped? I have copied them on my Mac, double clicked them, and the *.json files are there.

For the filenames, I am using as they are in the Entry_Generated_File table and select only those for which filename == <Accession_Code>.

For the xx, the issue specifies to use getHash which returns 000 for the accession_code_mode == PDBDEV.

Are your comments referring only for the accession_code_mode == PDB case?

I am confused at this time with the corrections I need to do for the accession_code_mode == PDBDEV case before switching to the accession_code_mode == PDB case.

brindakv commented 3 months ago

Update annotations to update visibility of new tables to user via the menu.

aozalevsky commented 3 months ago

@svoinea current IHMValidation version is:

https://github.com/salilab/IHMValidation/tree/v1.2 or https://github.com/salilab/IHMValidation/commit/48c8415ddcf19e6a5f607e96bf8a21b0e05bf5c3

svoinea commented 3 months ago

Seems to me a misunderstanding. If v1.2 is a tag, then:

git log -1
commit 86c5267ce30904cc62907e195ea34f33858127cc (HEAD, tag: v1.2)
Author: Arthur Zalevsky <aozalevsky@gmail.com>
Date:   Mon Jul 22 20:10:39 2024 -0700

If taking the commit number:

git log -1
commit 48c8415ddcf19e6a5f607e96bf8a21b0e05bf5c3 (HEAD, origin/release-1-2)
Author: Arthur Zalevsky <aozalevsky@gmail.com>
Date:   Mon Jul 29 10:59:23 2024 -0700

So, production has now the commit 48c8415ddcf19e6a5f607e96bf8a21b0e05bf5c3 (HEAD, origin/release-1-2) as it is more recently.

aozalevsky commented 3 months ago

@svoinea no confusion. i move the release tag to the most recent tag in the release branch. if you were to deploy from scratch today, v1.2 would pull 48c841. anyway, 48c841 is a correct tag.

brindakv commented 3 months ago

Update filename current_holdings.json.gz to current_file_holdings.json.gz.

Update the contents of this file to include validation_report after mmcif.

hongsudt commented 2 months ago

code changes to support idempotent

Assuming that T1 is this week cutoff time (e.g. used for setting Submission_Time). If we use the following Ermrest queries, this should allow us to run the script multiple times during the submission week. Note: If we want to change the submission datetime policy, do not run any script of the new week before changing the setting.

Notes: if the reference time need to be changed, it should be more than 7 days after the latest submission time in the system. Otherwise, we need to modify the RCT of re-release entries that were submitted on previous T0 to be the current T0, then change the RCT back after this cycle to reflect actual event.

# get a list of entries that were previously archived for entry2archived (same as above):
/ermrest/catalog/2/entity/E:=PDB:Entry_Latest_Archive/

# entry2upload: a list of entry ids to upload
The earlier query doesn't work since ermrest doesn't support substitution in the comparison statement. 
We will need to do 2 queries assuming that there is now a "Structure_Id" column in Entry_Latest_Archive:

# 1) Check for new released entries e.g. new REL entries (released after T0). Note: the new entries released this cycle will be returned with null constraint  if the script runs the first time (e.g. it hasn't been picked up before). The subsequent runs of the script will get the new releases from Entry_Latest_Archived entries that were generated within this cycle (e.g. RCT > T0 and Submittion_Time=T1) 

https://dev-aws.pdb-dev.org/ermrest/catalog/50/attribute/
E:=PDB:entry/Workflow_Status=REL/
F:=(id)=(PDB:Entry_Generated_File:Structure_Id)/File_Type=mmCIF/
A:=left(E:RID)=(PDB:Entry_Latest_Archive:Entry)/A:RID::null::;(A:RCT::gt::<T0>&A:Submission_Time=<T1>)/
$E/E:RID,E:id,...

# concrete example.. This works
https://dev-aws.pdb-dev.org/ermrest/catalog/50/attribute/E:=PDB:entry/Workflow_Status=REL/F:=(id)=(PDB:Entry_Generated_File:Structure_Id)/File_Type=mmCIF/A:=left(E:RID)=(PDB:Entry_Latest_Archive:Entry)/A:RID::null::;(A:RCT::gt::2024-08-22%2020%3A00%3A00-07%3A00&A:Submission_Time=2024-08-29%2020%3A00%3A00-07%3A00)/$E/E:RID,E:id,E:Deposit_Date,E:Accession_Code,F:File_URL,A:Entry,A:Submission_Time,A:mmCIF_URL

2) Get a list of re-released entries e.g. entries that mmCIF contents have changed since T0

https://dev-aws.pdb-dev.org/ermrest/catalog/50/attribute/
A:=PDB:Entry_Latest_Archive/A:RCT::leq::<T0>/
E:=(A:Entry)=(PDB:entry:RID)/Workflow_Status=REL/
F:=left(A:mmCIF_URL)=(PDB:Entry_Generated_File:File_URL)/F:RID::null::;(F:File_Type=mmCIF&A:Submission_Time=<T1>)/
$A/A:Entry,Archived_mmCIF_URL:=A:mmCIF_URL,Archived_Time:=A:Submission_Time

# concrete example. 
# -- waiting to test during 08/29 cycle:
https://dev-aws.pdb-dev.org/ermrest/catalog/50/attribute/A:=PDB:Entry_Latest_Archive/A:RCT::leq::2024-08-22T20%3A00%3A00-07%3A00/E:=(A:Entry)=(PDB:entry:RID)/Workflow_Status=REL/F:=left(mmCIF_URL)=(PDB:Entry_Generated_File:File_URL)/F:RID::null::;(F:File_Type=mmCIF&A:Submission_Time=2024-08-29%2020%3A00%3A00-07%3A00)/$A/E:RID,E:id,E:Deposit_Date,E:Accession_Code,F:File_URL,F:File_Name,A:Entry,A:Submission_Time,A:mmCIF_URL

# -- this URL is no longer relevant (it only supports the cycle of 08/22)
https://dev-aws.pdb-dev.org/ermrest/catalog/50/attribute/A:=PDB:Entry_Latest_Archive/
E:=(Entry)=(PDB:entry:RID)/Workflow_Status=REL/$A/F:=left(mmCIF_URL)=(PDB:Entry_Generated_File:File_URL)/F:File_URL::null::;(A:Submission_Time=2024-08-22%2015%3A00%3A00-07%3A00&A:RCT::leq::2024-08-15%2015%3A00%3A00-07%3A00)/$A/A:Entry,Archived_mmCIF_URL:=A:mmCIF_URL,Archived_Time:=A:Submission_Time,A:RCT,A:RMT

Model changes

@brindakv asked to add the two new columns to PDB_Archive table to help break down the submitted entries:

brindakv commented 2 months ago
hongsudt commented 2 months ago

Add the following configuration parameters: cutoff_time_pacific: "Thursday 20:00"

hongsudt commented 2 months ago

@svoinea We will need to drop the btree indexes for "Submitted_Files" and "Submission_History" using psql.. We should turn off facet and sorting in the annotation for these columns as well since btree is now gone.