Open hongsudt opened 7 months ago
The following issues are found:
*.gz
holdings files cannot be unzipped.
data/entries
sub-folder/pdb_ihm/data/entries/{xx}/{PDB_ID.lower()}/structures/{PDB_ID.lower()}.cif.gz
/pdb_ihm/data/entries/{xx}/{PDB_ID.lower()}/validation_reports/{PDB_ID.lower()}_full_validation.pdf.gz
/pdb_ihm/data/entries/{xx}/{PDB_ID.lower()}/validation_reports/{PDB_ID.lower()}_summary_validation.pdf.gz
@brindakv What do you mean by *.gz holdings files cannot be unzipped
?
I have copied them on my Mac, double clicked them, and the *.json
files are there.
For the filenames
, I am using as they are in the Entry_Generated_File
table and select only those for which filename == <Accession_Code>
.
For the xx
, the issue specifies to use getHash
which returns 000
for the accession_code_mode == PDBDEV
.
Are your comments referring only for the accession_code_mode == PDB
case?
I am confused at this time with the corrections I need to do for the accession_code_mode == PDBDEV
case before switching to the accession_code_mode == PDB
case.
Update annotations to update visibility of new tables to user via the menu.
@svoinea current IHMValidation version is:
https://github.com/salilab/IHMValidation/tree/v1.2 or https://github.com/salilab/IHMValidation/commit/48c8415ddcf19e6a5f607e96bf8a21b0e05bf5c3
Seems to me a misunderstanding. If v1.2
is a tag, then:
git log -1
commit 86c5267ce30904cc62907e195ea34f33858127cc (HEAD, tag: v1.2)
Author: Arthur Zalevsky <aozalevsky@gmail.com>
Date: Mon Jul 22 20:10:39 2024 -0700
If taking the commit number:
git log -1
commit 48c8415ddcf19e6a5f607e96bf8a21b0e05bf5c3 (HEAD, origin/release-1-2)
Author: Arthur Zalevsky <aozalevsky@gmail.com>
Date: Mon Jul 29 10:59:23 2024 -0700
So, production has now the commit 48c8415ddcf19e6a5f607e96bf8a21b0e05bf5c3 (HEAD, origin/release-1-2)
as it is more recently.
@svoinea no confusion. i move the release tag to the most recent tag in the release branch. if you were to deploy from scratch today, v1.2 would pull 48c841. anyway, 48c841 is a correct tag.
Update filename current_holdings.json.gz
to current_file_holdings.json.gz
.
Update the contents of this file to include validation_report
after mmcif
.
Assuming that T1 is this week cutoff time (e.g. used for setting Submission_Time). If we use the following Ermrest queries, this should allow us to run the script multiple times during the submission week. Note: If we want to change the submission datetime policy, do not run any script of the new week before changing the setting.
Notes: if the reference time need to be changed, it should be more than 7 days after the latest submission time in the system. Otherwise, we need to modify the RCT of re-release entries that were submitted on previous T0 to be the current T0, then change the RCT back after this cycle to reflect actual event.
# get a list of entries that were previously archived for entry2archived (same as above):
/ermrest/catalog/2/entity/E:=PDB:Entry_Latest_Archive/
# entry2upload: a list of entry ids to upload
The earlier query doesn't work since ermrest doesn't support substitution in the comparison statement.
We will need to do 2 queries assuming that there is now a "Structure_Id" column in Entry_Latest_Archive:
# 1) Check for new released entries e.g. new REL entries (released after T0). Note: the new entries released this cycle will be returned with null constraint if the script runs the first time (e.g. it hasn't been picked up before). The subsequent runs of the script will get the new releases from Entry_Latest_Archived entries that were generated within this cycle (e.g. RCT > T0 and Submittion_Time=T1)
https://dev-aws.pdb-dev.org/ermrest/catalog/50/attribute/
E:=PDB:entry/Workflow_Status=REL/
F:=(id)=(PDB:Entry_Generated_File:Structure_Id)/File_Type=mmCIF/
A:=left(E:RID)=(PDB:Entry_Latest_Archive:Entry)/A:RID::null::;(A:RCT::gt::<T0>&A:Submission_Time=<T1>)/
$E/E:RID,E:id,...
# concrete example.. This works
https://dev-aws.pdb-dev.org/ermrest/catalog/50/attribute/E:=PDB:entry/Workflow_Status=REL/F:=(id)=(PDB:Entry_Generated_File:Structure_Id)/File_Type=mmCIF/A:=left(E:RID)=(PDB:Entry_Latest_Archive:Entry)/A:RID::null::;(A:RCT::gt::2024-08-22%2020%3A00%3A00-07%3A00&A:Submission_Time=2024-08-29%2020%3A00%3A00-07%3A00)/$E/E:RID,E:id,E:Deposit_Date,E:Accession_Code,F:File_URL,A:Entry,A:Submission_Time,A:mmCIF_URL
2) Get a list of re-released entries e.g. entries that mmCIF contents have changed since T0
https://dev-aws.pdb-dev.org/ermrest/catalog/50/attribute/
A:=PDB:Entry_Latest_Archive/A:RCT::leq::<T0>/
E:=(A:Entry)=(PDB:entry:RID)/Workflow_Status=REL/
F:=left(A:mmCIF_URL)=(PDB:Entry_Generated_File:File_URL)/F:RID::null::;(F:File_Type=mmCIF&A:Submission_Time=<T1>)/
$A/A:Entry,Archived_mmCIF_URL:=A:mmCIF_URL,Archived_Time:=A:Submission_Time
# concrete example.
# -- waiting to test during 08/29 cycle:
https://dev-aws.pdb-dev.org/ermrest/catalog/50/attribute/A:=PDB:Entry_Latest_Archive/A:RCT::leq::2024-08-22T20%3A00%3A00-07%3A00/E:=(A:Entry)=(PDB:entry:RID)/Workflow_Status=REL/F:=left(mmCIF_URL)=(PDB:Entry_Generated_File:File_URL)/F:RID::null::;(F:File_Type=mmCIF&A:Submission_Time=2024-08-29%2020%3A00%3A00-07%3A00)/$A/E:RID,E:id,E:Deposit_Date,E:Accession_Code,F:File_URL,F:File_Name,A:Entry,A:Submission_Time,A:mmCIF_URL
# -- this URL is no longer relevant (it only supports the cycle of 08/22)
https://dev-aws.pdb-dev.org/ermrest/catalog/50/attribute/A:=PDB:Entry_Latest_Archive/
E:=(Entry)=(PDB:entry:RID)/Workflow_Status=REL/$A/F:=left(mmCIF_URL)=(PDB:Entry_Generated_File:File_URL)/F:File_URL::null::;(A:Submission_Time=2024-08-22%2015%3A00%3A00-07%3A00&A:RCT::leq::2024-08-15%2015%3A00%3A00-07%3A00)/$A/A:Entry,Archived_mmCIF_URL:=A:mmCIF_URL,Archived_Time:=A:Submission_Time,A:RCT,A:RMT
jsonb
to the column:
Assuming row is the existing Entry_Latest_Archive row before updating:
row["Submission_History"].update({
row["Submission_Time"]: {
"mmCIF_URL": row["mmCIF_URL"],
"Submitted_Files": row["Submitted_Files"]
}
}
@brindakv asked to add the two new columns to PDB_Archive
table to help break down the submitted entries:
Add the following configuration parameters: cutoff_time_pacific: "Thursday 20:00"
@svoinea We will need to drop the btree
indexes for "Submitted_Files" and "Submission_History" using psql..
We should turn off facet and sorting in the annotation for these columns as well since btree is now gone.
Requirements
Three holdings files to be generated:
For each released entry, the following files are to be transferred:
2nd and 3nd character of PDB_Accession_Code
(lower case), else return2nd and 3nd character of PDB_Accession_Code
(lower case)Model changes
PDB.PDB_Archive
with the following columnSubmission_Time
: timestamptz --> get from getArchiveDate(new())Submitted_Entries
: number of submitted entries for this archival submission eventCurrent_File_Holdings_URL
(The current listing of entries and files in the pdb-ihm archive)Released_Structures_LMD_URL
(The last modified date for all released entries in the pdb-ihm archive)Unreleased_Entries_URL
(Metadata about the list of unreleased entries i.e., the entries onHOLD
with accession codes issues)/hatrac/pdb/generated/archive/pdb_ihm/holdings/<year>/<Submission_Date>
PDB.Entry_Latest_Archive
with the following columnEntry
: fkey to Entry.RIDmmCIF_URL
: text --> versioned_mmCIF last archived (Note: This is better than MD5){"mmcif": ["/hatrac/pdb/..../entry_1234.mmcif"], "validation_report": ["/hatrac/pdb/.../report1.pdf", "/hatrac/pdb/.../report1_summary.pdf"]
Vocab.Archive_Category
Archive_Category
toVocab.System_Generated_File_Type
to identify different categories that the doc is supposed to file into when archived to PDB. When it is unset, it means DO NOT ARCHIVE.Update System_Generated_File_Type
"mmCIF" -- "mmcif" "Validation: Full PDF" -- "validation_report" "Validation: Summary PDF" -- "validation_report"
parameters or config
scratch_dir =
set up archive_category_dir_names at the beginning of pipeline. Set once, use throughout
archive_directory = "pdb_ihm" holding_directory = "%s/holdings" % (archive_directory) data_directory = "%s/data" % (archive_directory) archive_category_dir_names is derived from
ermrest/catalog/1/attribute/Vocab.Archive_Category/Name,Directory_Name
. Key is the Name, and Value is Directory_Namenote: entry_id should be the entry's PDB Accession Code. @brindakv Is this righ?, If so, we can change entry_id to entry_accession_code
def getFileArchiveSubDirectory(entry_id, archive_category): global archive_category_dir_names
global data_dir
entry_dir = {data_directory}/entries/{hash}/{accession_code}
entry_dir =
%s/entries/%s/%s/
% (data_directory, getHash(entry_id), entry_id) return entry_dir + archive_category_dir_names[archive_category]/ermrest/catalog/2/entity/E:=PDB:Entry_Archive/
entry2archived: { "entry1": {"RID": "xx", "Entry":"xx", "mmCIF_URL":"xx", "Submitted_Files":"",..., "Submission_Time":"xx"} }
entry2upload: a list of entry ids
/ermrest/catalog/2/attribute/E:=PDB:Entry/Status=REL/ F:=(structure_id)=(PDB:System_Generated_File:structure_id)/File_Type=mmCIF/ A:=(structure_id)=left(PDB:Entry_Latest_Archive)/mmCIF_URL::null::;mmCIF_URL::neq::F:File_URL/$F/E:structure_id,F:File_URL,...
entry2upload_files: a list of files to be archived. Note: WIll have to address URL-length limitation of there are many RIDs
Question: @brindakv will we ever go from HOLD to REL?
/ermrest/catalog/2/entities/V:=Vocab:System_Generated_File_Type/!Archive_Category::null::/F:=(Name)=(PDB:System_Generated_Files)/Entry=Any(",".join(entry2upload.keys()))
Note: entry Submission_Time of these entries should be set to
getArchiveDate(Now())
When generate the path, assume that data path is at root e.g.
for entry in entry2archived: for category, files in entry2archived["Submitted_Files"].items(): for file in files: path = "/%s/%s/%s" % (getFileArchiveSubDirectory(entry["Accession_Code"], category), category, file)
ernrest query to get a list of holding entries
/ermrest/catalog/2/attribute/E:=PDB:Entry/Status=HOLD/
File 1: released_structures_last_modified_dates.json
File 2: current_file_holdings.json
File 3: unreleased_entries.json
cron job
production
only)production
only)Error handling
entry.Accession_Code
or if system generated files are missing, exclude the entry from REL/HOLD for the week and send email with error message so that curators can fix it for next week's release1-PQRS: HOLD; 2-PQRS: REL
ACL needs to be updated for the new tables