Unable to process assembly cif files that have 2 and 3 character auth_asym_ids

danny305 commented 1 year ago

The mmCIF file format allows auth_asym_ids to be up to 3 characters long while the PDB format limits them to 1 char. This fundamental difference seems to require refactoring all the chain logic from chat to char *.

I haven't written C/C++ since I last helped you implement the CIF capabilities (almost 3 years ago). I have been trying to implement these changes and I am in over my head and do not feel comfortable trying to implement this change in your C code, especially in all of your logic for processing the PDB files.

For me the problem is with line 612 on cif.cc where it explicitly select the first char of the chain and freesasa_node_atom_chain returns a char.

Here are some pdb codes where this fails when you use the assembly file from the RCSB:

7nzc
7cma
6ihe (this one fails for the wrong atom name after I process it with ChimeraX to add hydrogens--separate issue)

URL to wget the assembly files:

wget -O 7cma-assembly.cif https://files.rcsb.org/download/7cma-assembly1.cif

Tell me how you want to proceed. I can refactor the C++/cif logic from char to char * but dont think I can refactor the PDB logic, thats all in C and you are doing some intense memory management.

mittinatten commented 1 year ago

Hi, I have barely written any C myself since then either :)

I had a quick glance at the code, and I think changing the representation from char to char * in structure.c should be relatively straight-forward, although a bit tedious (propagating to a bunch of files across the code base). I don't remember the reason the other strings on the atom struct are dynamically allocated, and not fixed length arrays, but that might have made the code a bit simpler. As mentioned in a comment at the top of the file, this is the messiest part of the whole code base, and if this were still my day job I would have put som effort into ironing it out a little. I'm not surprised you struggled changing it.

The main challenge I see with changing the representation however, is that the chain selection logic relies on chains being labelled by single characters (i.e. in the -g command line option and probably also --select). In addition, at least the fixed width RSA output format is not compatible. So, that would require some though to make the changes as unintrusive as possible, and will most likely involve breaking changes.

Taken together, it's a bit more than I have capacity for in my spare time at the moment, I'm afraid. But I'll consider it, to see if any simplifications come to mind, that would make it feasible.

mittinatten commented 1 year ago

I changed the code to internally store the chain-label in a char[4] instead of a char (see 7a6d0fedcafb57128f7713962452d7176f9fe6dc). To not break any interfaces I read out chain_label[0] where these labels are accessed from the atom struct in structure.c, and there is still no way of adding 3-letter labels.

This was the easy part, still not sure what to do about the rest, but the following steps seem like a minimalistic approach to assess if we can add this change without too much repercussions:

Convert the internal function structure_add_atom_wopt to take a char * instead of char for chain_label,
Have freesasa_structure_add_atom and freesasa_structure_add_cif_atom modify its char accordingly (that way we keep the external interface intact).
Add a new function called freesasa_structure_add_cif_atom2 or something like that accepts a char * instead.
Have the CIF input code use the function in 3. and see what works and what not in the analysis and output parts of the code.

If we can get through 4. without crashing the app and without very confusing output, one could possibly use the feature as is, and publish a caveat that chain-based analysis might break for 2 and 3-letter auth_asym_ids. E.g. chains AAA and AAB will be merged, which is probably not an issue for a lot of use cases. We could even detect those and log a warning about limitations.

If this works out, we should make the new function in step 3 the default one in the next major release, to avoid the confusion of having two almost identical functions in the long run. But I think we can wait with that a bit.

danny305 commented 1 year ago

I was getting ready to respond to you and then I saw your latest response. I was deep into refactoring the code to move to char * last week but had to stop half way for a deadline I have next week. I think I already did 1 and 2 and 3. I will need to go back and look at it what I actually did.

I haven't committed anything yet but have made lots of changes. Im nervous I am just breaking everything lol. I'll follow up the following week when I am at NeurIPS if I have a chance to dive back into it and review your latest commit with char[4].

danny305 commented 1 year ago

Thanks for finding some time to look into how we would do this refactoring. I dont understand why AAA and AAB would be merged with char[4]?

mittinatten commented 1 year ago

Im nervous I am just breaking everything lol.

I hear you! The test suite is quite comprehensive though, so as long as that passes you can be relatively confident that nothing important is broken. The only exception is testing automatically that the clean up code doesn't miss anything (perhaps there are testing libraries/techniques for that too that I'm not aware of). So one has to be a bit careful not to introduce memory leaks (not so important for the CLI, but quite so for the C and Python APIs).

When I wrote most of the code back in 2015, I used valgrind to manually verify memory management, but that stopped working on Mac OS a few years back, and I haven't found another solution (other than spinning up a Linux environment).

The important cleanup code is in functions such as the ..._dealloc functions in structure.c, and they shouldn't be necessary to change for this project. The cleanup code that makes sure everything is clean if malloc and its siblings fail is probably over-engineering on my part, if malloc fails, you probably have more serious problems than cleaning up a few arrays, and a crash is probably inevitable anyway. That level of error handling makes sense for systems code, not so much for scientific tools (I have considered simply removing it).

I dont understand why AAA and AAB would be merged with char[4]?

As long as functions such as freesasa_structure_residue_chain() return a char and freesasa_structure_chain_atoms() accept a char for the chain_label, the output will only use the first letter of the chain name. I.e. the difference will be stored internally but not reflected in the output.

Once we change those functions, we have a breaking change, which will need to be coordinated across all output options and also eventually the Python bindings. Most changes will be more or less trivial, and the compiler will hopefully tell us when we miss something. I think it's only freesasa_structure_get_chains() that will cause complicated changes that require designing something new, so perhaps a viable trajectory would be to change all the others and add a caveat to freesasa_structure_get_chains() (and all the functionality that relies on it, i.e. at least the --chain-groups CLI option). I have some ideas how one could change that in a backwards compatible way, but it should preferrably be a project of its own. (I am also not sure how this will affect the --select functionality.)

One option is to support two variants of these functions across the board, i.e. add new functions with a suffix in the function name, e.g. _lcl (for long chain label). This allows us to gradually move over the output modules one by one to the new form, and add tests for the new forms, without breaking anything. Then, when everything is migrated, we can remove the suffix, delete the old variants and publish a major release with the new interface.

I prefer working gradually like that, instead of creating a major PR with 1000s of lines of code. This also allows me to fix unrelated bugs along the way, if anything comes up, without having to sync a feature branch. Also, users who work directly from the master branch will be able to give us early notice if we break anything that isn't caught by the tests (not sure if anyone uses the code in that way though).

mittinatten commented 11 months ago

I have updated structure.c to use the long form everywhere internally, and just hide it at the interfaces. The next step is to duplicate most of the following functions to versions that support long-form chain-labels:

[ ] freesasa_structure_add_atom()
[ ] freesasa_structure_add_atom_wopt()
[x] freesasa_structure_add_cif_atom() -> freesasa_structure_add_cif_atom_lcl() (ceed1b0)
[x] freesasa_structure_get_chains()
[x] freesasa_structure_chain_labels() -> freesasa_structure_number_chains() and freesasa_structure_chain_label() (cf92868)
[x] freesasa_structure_atom_chain() > freesasa_structure_atom_chain_lcl() (34a3a34)
[x] freesasa_residue_chain() -> freesasa_residue_chain_lcl() (4ff453c)
[x] freesasa_chain_atoms() -> freesasa_chain_atoms_lcl() (a8a15cc)
[x] freesasa_chain_residues() -> freesasa_chain_residues_lcl() (a8a15cc)

The majority will be trivial changes, most of the work will probably be in adding tests.

mittinatten commented 9 months ago

Hi again Danny, Finally got around to looking at this again. There is still some work left before all parts of the API and CLI are consistent, but I have added the code necessary to get correct output for these auth_sym_ids in commit e6783aecf3cc61ed2219a6a7ebe76626cf8778e3. The only visible change for users so far is some spacing in the fixed width RSA and default/log formats (to make room for 2 more letters). The other output formats should be unchanged for inputs with 1-letter chains (and worked out of the box for the 3-letter variants).

I tested that it works for two of the assemblies you provided. Can you verify that it works as expected?

Simon

danny305 commented 8 months ago

Hi Simon,

Thanks for working on this. I have been swamped but will have bandwidth in the next two weeks to review your changes and provide you some feedback! To using this updated version in our machine learning pipelines.

danny305 commented 8 months ago

So we are still unable to write out cif files. It seems like the error is in the freesasa_node_atom_chain function, which queries the atom properties struct for the chain id. The atom node still saves the chain as a char rather than a char *.

Here is the line for the error: https://github.com/mittinatten/freesasa/blob/e6783aecf3cc61ed2219a6a7ebe76626cf8778e3/src/cif.cc#L606

It also needs to get updated to return more than a string of length 1.

danny305 commented 8 months ago

I think you should set up a pull request for this issue with the current commit you shared. I'll see if I can also take a hack at this when I get back from vacation the week after next.

jamesmloy commented 8 months ago

Hi Simon-- I was able to take a look at this today and made a PR with a fix. Let me know what you think.

mittinatten commented 8 months ago

Thanks for pointing this out, I have merged the fix.

mittinatten / freesasa

Unable to process assembly cif files that have 2 and 3 character auth_asym_ids #96