Closed pgbarletta closed 12 months ago
I can point to a few efforts for some binary formats for large scale analyses:
Also nothing stops you from using MMTF to convert PDB files into a binary format. Even though the tools around it are not being actively developed, I would assume that most of them still work fine.
Thanks, hopefully one of these ends up being offered alongside PDBs, PDBxs.
BinaryCIF is offered alongside PDB and mmCIF (see e.g. here)
awesome, I had missed that! Thanks again for taking the time.
There is also mmJSON (the content of mmCIF in JSON) available from PDBj.
One additional fresh data point is that as of the last PDB update this week, there's a structure 8ckb that broke the 4-character author chain id assumption in the mmtf spec. Note that the RCSB PDB still provides up-to-date mmtf files for the whole PDB archive, but will be discontinuing that feature in the near future (announcements will be made).
Perhaps it could be remediated? The mmCIF format is badly under-specified, so naming the chains mychain001 is formally correct, but it will cause problem in many programs.
Just to confirm - MMTF has no future? I have a library which has MMTF as one of the four filetypes it parses and which is still actively developed - would it make sense to drop this and focus on the other three? (BinaryCIF is one of the other three.)
Just to confirm - MMTF has no future?
Correct. We recommend that the community moves to BinaryCIF as an efficient (whilst metadata-complete) file format. BinaryCIF is backed by the whole wwPDB.
MMTF will continue to exist since the tools are still available (though only minimally maintained), but MMTF files for the PDB archive will soon not be produced anymore by RCSB PDB.
@samirelanduk IIUC you wrote both BinaryCIF and MMTF parsers. How do they compare in terms of complexity and speed?
One really nice property of MMTF is that it explicitly stores all bonds, and does not depend on external dictionaries like the Chemical Component Dictionary. Can BinaryCIF also provide that?
One really nice property of MMTF is that it explicitly stores all bonds, and does not depend on external dictionaries like the Chemical Component Dictionary. Can BinaryCIF also provide that?
Yes, one has to include the chem_comp_bond categories in the BinaryCIF file. There appears to be some effort to do so in the PDB member provided files, see https://www.wwpdb.org/news/news?year=2023#649f0801d78e004e766a9680.
Yes, one has to include the chem_comp_bond categories in the BinaryCIF file. There appears to be some effort to do so in the PDB member provided files, see https://www.wwpdb.org/news/news?year=2023#649f0801d78e004e766a9680.
Thanks Alex. That sound like the "Updated mmCIF file" downloads provided by the PDBe. This helps a lot. It's still more complex than MMTF when you consider ALT locations, and you have to get the linkage right.
One other difference between BinaryCIF and MMTF is that MMTF contains the transformation matrices to create the biological assemblies. In CIF this is not straightforward since the transformations are encoded for complex assemblies, such as viruses. For example, for PDB 1M4X the transformation matrices must be created from_pdbx_struct_assembly_gen.operexpressions: loop _pdbx_struct_assembly_gen.assembly_id _pdbx_struct_assembly_gen.oper_expression _pdbx_struct_assembly_gen.asym_id_list 1 '(1-60)(61-88)' A,B,C 2 '(61-88)' A,B,C 3 '(1-5)(61-88)' A,B,C 4 '(1,2,6,10,23,24)(61-88)' A,B,C 5 '(1-5)(63-68)' A,B,C 6 '(1,10,23)(61,62,69-88)' A,B,C 7 '(P)(61-88)' A,B,C
On Tue, Nov 28, 2023 at 11:19 AM Thomas Holder @.***> wrote:
Yes, one has to include the chem_comp_bond categories in the BinaryCIF file. There appears to be some effort to do so in the PDB member provided files, see https://www.wwpdb.org/news/news?year=2023#649f0801d78e004e766a9680.
Thanks Alex. That sound like the "Updated mmCIF file" downloads provided by the PDBe. This helps a lot. It's still more complex than MMTF when you consider ALT locations, and you have to get the linkage right.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>
Another difference is how secondary structure is handled. CIF contains secondary structure that has been assigned by different algorithms over the past 5 decades. In contrast, MMTF contains consistently calculated DSSP secondary structure for all structures in the PDB.
@wojdyr I've found BinaryCIF is much less complex to parse as once you've performed the decompressing you basically have the mmCIF ready to go, whereas with MMTF you then have to decompress and then perform the additional step of constructing the mmCIF tables you want from the MMTF dictionary. I haven't done a speed of parsing comparison, but I'd be surprised if it was a big difference.
An update: the MMTF-format files produced weekly for the entire PDB archive and served at mmtf.rcsb.org will be deprecated on the 2nd of July 2024. See the announcement
By the way, that structure that broke the four character chain name assumption, 8ckb, has been remediated in PDBe and I was told that the chain names will be limited to four characters now.
What's the rationale for replacing MMTF with BinaryCIF?
8ckb, has been remediated in PDBe and I was told that the chain names will be limited to four characters now.
Oh nice, I hadn't seen that. Thanks. The corresponding MMTF file should be out next week.
What's the rationale for replacing MMTF with BinaryCIF?
The main reason is that BinaryCIF is metadata complete and fully compatible with the mmCIF dictionary. As such it is more future proof and offers more flexibility. Plus it is a much more appropriate format for transmitting the archival data. Much of what we learned from MMTF went into designing BinaryCIF, so it is an evolution of the format.
In any case, I am sure we will continue seeing many more developments in this topic, given the data deluge coming from AI-produced models. Some of the new developments were pointed by @gtauriello above.
I'd appreciate it if anyone would spare 5 mins. I'm training models on the PDB and I'd love a binary format alternative for these kinds of tasks