Support for CIF format - Githubissues

danny305 commented 4 years ago

We use freeSASA in our ML data engineering stack and we are currently moving our stack from the pdb file format to the new format (mmCIF/CIF). I am wondering if the freeSASA library developers plan on moving the library to the new format anytime soon?

Danny

mittinatten commented 4 years ago

Hi, I am the sole developer and I have a completely different day job nowadays, so my capacity for adding new features is somewhat limited (I have only done bug fixes the last 2 years).

I agree that mmCIF support would be a useful addition, and it has been requested before. I am definitely willing to review PRs and discuss strategies with any volunteers, but I'm not sure when I would be able to write much code myself, unfortunately. One possible, relatively simple approach would be to use a library that can parse mmCIF and then map the output to the lower level either C or Python APIs of FreeSASA. Perhaps this could also be integrated into the library, licenses permitting. (Can't find any license information here https://sw-tools.rcsb.org/apps/CIFPARSE-OBJ/index.html)

danny305 commented 4 years ago

I have forwarded your message to my team and would provide you a more thorough answer here in the next few days.

One more thing I would like to discuss with you. From what I understand from your library, it only provided the SASA value for protein atoms? Does is it also capable of providing the SASA value for glycans, ligands, metals, and any other type of non protein atoms in the pdb file? If not would you be willing to work with us on possibly adding these feature? This is something that is very important for us to expand our ML models that utilize SASA.

Danny

mittinatten commented 4 years ago

Great!

Regarding non-protein atoms: There is a flag to support HETATM entries, se the bottom of this doc page: http://freesasa.github.io/doxygen/CLI.html. But you should probably supply a configuration file with atomic radii for the extra atom types.

The default classifier recognizes some of the most common elements and has a table of VdW radii, but not for exotic elements.

For organic compounds you can use https://github.com/mittinatten/freesasa/blob/master/scripts/chemcomp2config.pl to generate classifier configurations, to be used with the ProtOr radii which take the chemical environment into account, i.e. is better than simple VdW values. I used this to generate the configurations for amino acids and nucleic acids in https://github.com/mittinatten/freesasa/blob/master/share/protor.config.

Simon

mittinatten commented 4 years ago

Here is the table of known elements https://github.com/mittinatten/freesasa/blob/master/src/classifier.c#L865

danny305 commented 4 years ago

Simon,

I am currently going through my qualifying exams and have to present late next week. After that is over, I would love to work with you and adding cif compatibility. Let's talk some more on how to achieve this then?

mittinatten commented 4 years ago

Great! Just let me know

ons 6 maj 2020 kl. 19:41 skrev Danny Diaz notifications@github.com:

Simon,

I am currently going through my qualifying exams and have to present late next week. After that is over, I would love to work with you and adding cif compatibility. Let's talk some more on how to achieve this then?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mittinatten/freesasa/issues/48#issuecomment-624790612, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAVV4HBUKYODOSCYOKWAXCTRQGOLHANCNFSM4MTL3TUA .

mittinatten commented 3 years ago

Hi again, I've had a look at this again, and I think it would be pretty straightforward to add a PDBML parser, not sure if that would cover your needs? Libxml is already an optional dependency, so it won't add extra build/install complexity (keeping that part simple has high priority). I might have a go at that.

Since I'm not active in the field anymore, I have no idea which formats are most actively used, though. PDBML looks very verbose to me but it has a one-to-one mapping to CIF as far as I understand. XML has the advantage that we can use a parser that is thoroughly tested and optimised.

A similar argument could be made for mmJSON. The library json-c is also already an optional dependency.

There are also a couple of CIF-parsers out there written in C++ that would probably be feasible to integrate (like the one linked above). I'm not sure if the licenses are compatible with the MIT license that FreeSASA uses though. CIF is simple enough that writing a parser from scratch could also be possible, but out of scope for me at least.

mittinatten commented 3 years ago

I got a PoC for CIF-input working without much effort using GEMMI. There will be some work to make this robust and support all options etc, but it seems feasible.

PR here: https://github.com/mittinatten/freesasa/pull/51

danny305 commented 3 years ago

Simon, sorry for disappearing on you! I got caught up with some other projects and forgot to respond to this.

I am making my way back to working with FreeSASA and just saw that you utilized GEMMI for the cif parser. This is great because we utilize GEMMI as well in our codebase. I see that you have a pull request for the feature but you have not merged it yet. Is there anything I can do on my end to help with this?

mittinatten commented 3 years ago

Hi Danny, no problem! As you can see from the frequency of commits I have not had much time to work on this either. There is a checklist in the PR with remaining tasks.

Remaining command line options, for handling MODEL entries: The current implementation is pretty messy for PDB files (this is probably the ugliest part of the code), unfortunately. So it might be better to try to leverage GEMMI directly for this instead. It might also be possible to simply map the files to the internal representation and reuse the logic for regular PDB files, as I did for the chain logic. I will have to look into it a bit. Let me know if you want to have a go at implementing these, and I can try to point you in the right direction.
If you want to go ahead and do some tests with a larger set of PDBs to check that the application doesn't crash with weird files, as we discussed earlier, that would also be very helpful.
The other remaining tasks are more of an investigative nature, and some would require making the library a C++ library instead, which could potentially cause problems downstream, so that should probably be saved for a version 3. Converting the CLI to C++ doesn't affect any dependents, so that's pretty uncontroversial.

danny305 commented 3 years ago

So I do not know how to code in C and am decent at C++. I can begin looking at this more seriously next week. I will probably lean toward leveraging gemmi bc I am much more familiar with gemmi than C. I will have more questions for you when it comes to point 1 next week.

danny305 commented 3 years ago

I am trying to get the feature/cif branch to compile and I get the error ` $ autoreconf -i && ./configure && make && make install.

...
...
Making all in src. make[2]: No rule to make target main.c', needed bymain.o'. Stop.
make[1]: [all-recursive] Error 1. make: *** [all] Error 2. `

When I run this same command in master it runs without a problem.

I am not familiar with the autotool suite and have recently been learning how to use cmake so please bear with me. I am comparing the differences in the configure.ac and makefile.am between the branches and don't spot what could be preventing make from seeing main.c. Could you help me get the feature branch compiling so I can start trying to help?

danny305 commented 3 years ago

What I ended up doing is getting the feature/cif branch building with cmake instead of with your autotools build pipeline.

My cmake build is no where near as exhaustive as your autotools build is but it allows me to make changes to the feature/cif branch and build successfully. I have confirmed that it reproduces the output when I run $ ./freesasa 3nir.pdb

Ill take a look at what is needed to be done to help you finish this pull request these next couple of days.

danny305 commented 3 years ago

So I have ran the protein 3nir as a .pdb and a .cif and I get different outputs. Here they are below:

` $ ./freesasa 3nir.pdb

FreeSASA 2.0.3

PARAMETERS algorithm : Lee & Richards probe-radius : 1.400 threads : 2 slices : 20

INPUT source : 3nir.pdb chains : A model : 1 atoms : 327

RESULTS (A^2) Total : 2988.90 Apolar : 1976.87 Polar : 1012.03 CHAIN A : 2988.90

$ ./freesasa --cif 3nir.cif.

FreeSASA 2.0.3

PARAMETERS algorithm : Lee & Richards probe-radius : 1.400 threads : 2 slices : 20

INPUT source : 3nir.cif chains : A model : 1 atoms : 114

RESULTS (A^2) Total : 2748.49 Apolar : 1721.48 Polar : 1027.01 CHAIN A : 2748.49

`

Not sure why this is happening but thought I would share this information as I start digging into your cif/pdb source code.

mittinatten commented 3 years ago

Great to see you have started!

Regarding build problems: running make clean should solve that (the compiler adds some cache files that get confused when you switch from main.c to main.cc). Moving to CMake might not be a bad idea, but should probably be a separate project. I used autotools because that's what I knew, and it is installed on most *nix-environments, so it helped keep the set of explicit dependencies small.

Let me know what you find out about 1nir, I can also have a look over the weekend. I guess the different number of atoms are an important clue.

danny305 commented 3 years ago

I swore I thought I tried make clean this morning and it did not work. Its whatever, I had a blast getting your code to compile with CMake. I have your feature/cif compiling with cmake and ninja on vscode for quick development.

As for your source code: I am going through your cif.cc file trying to see how you process the atoms and see that you are using '?' for currentAltId. I counted how many atoms you are skipping to add to the cif structure and it adds up to 338.

I am not 100% sure of this yet; I need to spend more time looking at the 2 files (cif and pdb) carefully and referencing the cif standard. But I think the altLoc of the atom is not represented by ? when there is just 1 location but with a ( . ). I also think you might be using the wrong column. I think its column 4 not column 6 for altloc (you will notice all the (.)). I am not sure of this though. This is just my initial impression from comparing 3nir.pdb with 3nir.cif side by side.

Ill look at it some more over the weekend.

Also, I have been refactoring some of your cif.cc code to not use raw pointers but unique_ptr instead.

danny305 commented 3 years ago

Yeah. It seems like you are using the wrong column for the alternative location of atoms. check it out:

https://gemmi.readthedocs.io/en/latest/mol.html#pdbx-mmcif-format

mittinatten commented 3 years ago

Perfect, very good to have someone who knows the format better than me look at this. But, I thought that by calling block.find("_atom_site.", atom_site_columns) I defined which columns I wanted to extract, and their order, i.e. which column is which is decided by what's in atom_site_columns:

static const auto atom_site_columns = std::vector<std::string>({
    "group_PDB",
    "auth_asym_id",
    "auth_seq_id",
    "pdbx_PDB_ins_code",
    "auth_comp_id",
    "auth_atom_id",
    "label_alt_id",
    "type_symbol",
    "Cartn_x",
    "Cartn_y",
    "Cartn_z",
    "pdbx_PDB_model_num",
});

But the ? vs . problem is still there.

We should add 3nir to the test suite to catch regressions.

I haven't written any C++ in over 10 years, so all input is appreciated (I wasn't aware of the addition of unique_ptr to the library for instance). As you can see main.cc is just C with minimal modifications to make it valid C++. Not sure if it would make sense to turn it into more idiomatic C++. But I guess that should probably also be separate task.

Once you have some code you want to share, can you post it as a PR to the feature/cif branch? Possibly in several smaller PRs, if that makes sense (if you want me to test something, etc).

danny305 commented 3 years ago

You are right about the columns you code does a great job of finding the right ones. The bug was indeed the ? vs . and I have patched it. This patch results in the same output for the 3nir.pdb and 3nir.cif inputs.

I am going to submit a PR addressing this bug here shortly.

Moving forward:

I have a question for your get_chain function in cif.cc. I do not see where you use this function. Its a static function so it should be in the same implementation file. Can you explain to me what your plans are for this function?

So I was thinking it would probably be easier to map the protein to a gemmi doc object for both cif and pdb input files and then you can (from gemmi) write out either a pdb or cif file with the sasa values. This is just a thought let me know what you think.

Also, for the cif output file, you can could create a freesasa block or you can append SASA column to the atom_site block. Which direction were you planning on going with this?

Please let me know what you think I should focus on next. I am new to your source code and feel like I am everywhere.

mittinatten commented 3 years ago

I think get_chain was just an idea for implementing one of the chain-related command line options, that I ended up not using but forgot to clean up. So it can be deleted.

I agree that outsourcing the PDB-handling to Gemmi makes a lot of sense, but as I wrote earlier that would require converting the library to C++ (not only the CLI), which is a breaking change for other projects using the library. So my preferred roadmap would be to change as little as possible for now, for a 2.1 release, possibly with a 2.1-beta first. And then look into rewriting the PDB-parsing for a 3.0.

I haven't really worked with cif files much, are there any best practices for adding data to the files? Adding it to the atom_site block would be most similar to what is already done for pdb files, but we should then add a new column, not just hijack one like I did for the pdbs. There might be a need to add a separate block for the summed up and sequence level data as well.

mittinatten commented 3 years ago

Regarding what to focus on next, one small thing to start with is adding 3nir to the test data and add a test here: https://github.com/mittinatten/freesasa/blob/feature/cif/tests/test-cli.in#L217 (I wasn't very experienced with integration testing when I wrote that script, so it should also probably be rewritten using a proper test tool, at some point, but it's been very useful for catching regressions for now.)

The next thing are the --separate-models and --separate-chains CLI options. I guess the get_chain function could be useful for --separate-chains, but not sure yet how things should be wired (that's probably why I added it in the first place).

As I mentioned earlier, we should find some representative set of PDB codes (moderately large, say in the 1000s), where we can run the program with both input types and verify that the results are the same as a one off test. I'm not sure if I still have the scripts or the PDB codes I used when I did the same thing against Naccess 5 years ago, but if not I can look into setting this up, if you aren't very eager to :) Here we should check that we get the same end result (probably enough to compare total, polar and apolar) and the same warnings, if there are any.

When these features and tests are in place, we could probably merge a first version to the dev branch, and possibly to master, before we move on to cif output. To keep things organised I suggest we open a separate issue for cif output.

danny305 commented 3 years ago

Thanks for the direction Simon. I like your battle plan of not rewriting the entire codebase and focus on implementing these features first.. For me, its very important we incorporate reading cif files but more importantly writing cif files (this is what will feed into my ML code base).

I think a new block in the CIF file would be the best way to go because it doesn't break established parsers that are accustomed to reading the atom_sites block. I can talk to my team this weekend and see what they prefer/think is best. This is also a question that pdb2pqr is currently figuring out so it might be best to see what they do and imitate.

I can fix my PR and remove all the cmake stuff and just have one for the alt-loc patch. Im flying back from LA to Austin,TX tomorrow so will not get a chance to probably look at this till Sunday.

As for a test set. I have a diverse test set of pdb and cif files laying around (in the 10s of thousands) that I can subsample for you.

mittinatten commented 3 years ago

Great, I will look into setting up orchestration for the tests, in the mean time then.

I had a quick look into what is necessary to set up --separate-chains and --separate-models. In get_structures in main.cc we have

  // lines 269 to 271
  if ((state->structure_options & FREESASA_SEPARATE_CHAINS) ||
        (state->structure_options & FREESASA_SEPARATE_MODELS)) {
    structures = freesasa_structure_array(input, n, state->classifier, state->structure_options);
    // ...

which could probably be rewritten to

   if ((state->structure_options & FREESASA_SEPARATE_CHAINS) ||
        (state->structure_options & FREESASA_SEPARATE_MODELS)) {
     if (state->cif) {
       structures = freesasa_cif_structure_array(input, n, state->classifier, state->structure_options);
     } else {
       structures = freesasa_structure_array(input, n, state->classifier, state->structure_options);
     }
     // ...

and roughly in cif.cc

freesasa_structure**
freesasa_cif_structure_array(FILE *cif_file, int *n, const freesasa_classifier *classifier, int options) {
   const auto doc = gemmi::cif::read_cstream(input, 8192, "cif-input");
   if (options & FREESASA_SEPARATE_CHAINS) {
     if (options & FREESASA_SEPARATE_MODELS) {
       return get_array_of_all_chains_in_all_models(doc, *n, classifier);
     }
     return get_array_of_all_chains_in_first_model(doc, *n, classifier);
   } else if (options & FREESASA_SEPARATE_MODELS) {
     return get_array_of_all_models(doc, *n, classifier);
   } else {
     // should never happen, maybe an assertion error here
   }
}

The three new functions get_array_of_... (which probably can have better names), should then use gemmi to extract chains and models.

danny305 commented 3 years ago

The cli part in c/cpp is pretty intense; Im used to argparse in Python. I started reading about getopt and how it works to in order to digest what is going on in main.cc. Therefore, I will focus on writing the gemmi functions as I learn how the cli works in main.cc.

When it comes to the cif/pdb dataset. How do I get you these files?

mittinatten commented 3 years ago

Yeah, getopt doesn't exactly give the best developer ergonomics :). It's main advantage is that it's part of the GNU standard c lib (so everywhere except windows, basically). As far as I can tell, the only update needed in main.cc is the 3-4 lines I added above, option parsing for cif is already handled, so you can probably ignore that part for now.

For the test dataset, I think it's simplest if we store a list of pdb codes in the repository together with the test runner, for reproducibility. Storing the actual files in the code base would make the repo unnecessarily large. I am working on a test script, which will download the files needed, so everything can bootstrapped from the list. I'll use a smaller set while developing. Once it's up and running we can add your set to the same directory, unless there's a reason you don't want to share it with the Internet?

danny305 commented 3 years ago

I was looking at getopt it to see how to add --format=cif for writing cif files. But this is for later/separate issue so I got time to learn how that would be added to the cli and executed by the freesasa lib.

So I can't share my entire dataset of pdb codes but I can share a subset of like 1-3 thousand. I'll add it here in the next few days as a txt file.

mittinatten commented 3 years ago

Sounds good! Moving to C++ 14 forced me to update the build environment for the CI tests which in turn unearthed some other issues with the build configuration. The Travis setup hasn't been changed for a few years, so these changes were probably overdue anyway. So I'm a bit behind on the test runner, but should be able to have something later this week or next weekend.

danny305 commented 3 years ago

Yeah I know nothing about continuous integration and don't know what Travis really is. I am excited to learn about this entire aspect of software engineering!

mittinatten commented 3 years ago

CI gives a lot of confidence when making changes, given that you have a good test suite. The tests here are pretty awkward in a few places, and definitely don't follow best practices everywhere, but coverage is good and they have helped me uncover lots of regressions when I've made seemingly inconsequential changes.

mittinatten commented 3 years ago

I have merged your last PR and everything looks really good.

I ran a comparison between pdb and cif files for the set of structures you added. I haven't looked through all the data yet, but I fixed one error related to atoms with apostrophes in the name, as in nucleic acids:

ATOM   4084 O  "O5'" . DG  H 2 2   ? -16.078 -44.026 53.556 1.00 44.90 ? 2   DG  H "O5'" 1

The fix simply strips the quotation marks before adding the atoms to the structures: https://github.com/mittinatten/freesasa/pull/51/commits/cd8b44f6906c821f62068a30e077c110934515e5

Because these strings are passed as c_str() pointers, they have to belong in a scope outside the call to the freesasa_structure_add_cif_atom so I moved freesasa_structure_from_pred back into the loop. If I do the string manipulation in freesasa_structure_from_pred I have to allocate the modified string on the heap and then remember to free it outside of that function after it's been used. Maybe you know of some C++ wizardry that allows us to keep the logic in freesasa_structure_from_pred, without needing to explicitly free it?

danny305 commented 3 years ago

Good catch! DNA atoms are really important for our ML research.

I can submit a PR with some c++ wizardry 🙃

mittinatten commented 3 years ago

After fixing the above, we only get different results for a handful of pdb codes (see below). I only checked the first file, and the differences are of the order of 0.01-0.1%, so I'm not sure it's an actual error. One possible explanation could be differences in the resolution of the coordinates. I think a misclassification would cause a larger difference. I will have a closer look after the weekend, feel free to investigate if you have the time.

 [(PdbCode "4ye7", Error CalculationDifference);
 (PdbCode "3bkr", Error CalculationDifference);
 (PdbCode "5wrj", Error CalculationDifference);
 (PdbCode "2zsc", Error CalculationDifference);
 (PdbCode "3zdb", Error CalculationDifference);
 (PdbCode "1mrg", Error CalculationDifference);
 (PdbCode "1xoc", Error CalculationDifference);
 (PdbCode "3gyy", Error CalculationDifference);
 (PdbCode "3gnn", Error CalculationDifference);
 (PdbCode "5ir4", Error CalculationDifference);
 (PdbCode "3zvk", Error CalculationDifference);
 (PdbCode "5lby", Error CalculationDifference);
 (PdbCode "1xod", Error CalculationDifference);
 (PdbCode "6ayh", Error CalculationDifference);
 (PdbCode "4gbf", Error CalculationDifference);
 (PdbCode "4l9p", Error CalculationDifference);
 (PdbCode "1xi3", Error CalculationDifference);
 (PdbCode "3ro3", Error CalculationDifference);
 (PdbCode "4gdx", Error CalculationDifference);
 (PdbCode "5igi", Error CalculationDifference);
 (PdbCode "6bcm", Error CalculationDifference);
 (PdbCode "4peu", Error CalculationDifference);
 (PdbCode "3g5t", Error CalculationDifference);
 (PdbCode "2anv", Error CalculationDifference);
 (PdbCode "2i82", Error CalculationDifference)]

mittinatten commented 3 years ago

I committed the test runner to the repo. It works fine, but there's still potential to make the code prettier I think. What it does do nicely is utilize the really nice async capabilities of F# to do the tests in parallel.

mittinatten commented 3 years ago

The first atom of 4ye7.cif:

ATOM   1    N  N   . GLY A 1 3   ? 30.32151 15.29007 32.06752 1.000 77.69905 ? -1  GLY A N   1

The first atom of 4ye7.pdb:

ATOM      1  N   GLY A  -1      30.322  15.290  32.068  1.00 77.70           N

So I would say we can be pretty sure the tiny difference in the result here is due to a difference in coordinate resolution. I'll check the others too.

mittinatten commented 3 years ago

For 3bkr the number of atoms is different. They have a few altIds. Arg 42 has altIds B and C, but not A. So seems we can't assume there is always an A.

mittinatten commented 3 years ago

5wrj has a similar problem for the F chain as 3bkr. Seems this was a good input set :)

I've reached my limit for today, will continue down the list tomorrow. I also want to run tests with hetatms and hydrogens included, and some different settings for warnings/errors. But I think we're getting close!

danny305 commented 3 years ago

Interesting. I can see how to check for additional altIds if A is missing.

mittinatten commented 3 years ago

I fixed the altId code (just went back to the old logic, but with '.' instead of '?'). Now we're down to just a few differences. I will investigate during the weekend.

 [(PdbCode "4ye7", CalculationDifference);
 (PdbCode "3zdb", CalculationDifference);
 (PdbCode "3gnn", CalculationDifference);
 (PdbCode "3zvk", CalculationDifference);
 (PdbCode "6ayh", CalculationDifference);
 (PdbCode "4peu", CalculationDifference);
 (PdbCode "2i82", CalculationDifference)]

mittinatten commented 3 years ago

3gnn and 2i82 had an issue with guessing the radius of atoms in unknown residues. This was due to assumptions about whitespace that were invalid for CIF files. I have fixed that in the latets commit. The remaining files only differ in precision, so these are not actual errors.

This means that only documentation is missing before we can merge this.

mittinatten commented 3 years ago

I have updated docs now. So, unless you have any final thoughts, I think we can merge this to the dev branch, and then merge that to master pretty soon. I think I won't create a release for this until we have the output option. So I won't publish the new docs to the web page just yet.

danny305 commented 3 years ago

Ill pull down your branch and check it out this weekend. This is exciting!

mittinatten commented 3 years ago

I discovered one difference when using --hetatm, turns out there was a bug in the code that detects wether something is a hydrogen, it was a bit too simple. It only affected PDB files, so the CIF feature helped find it.

The reason this is a function at all is that some files don't have any symbol registered (probably only old files, and maybe some have been repaired). You have to fallback to guessing the element from the atom name in those cases. I had messed it up and did some atom name logic before the symbol, but it only ever caused an error for a few exotic elements: those that have 'H' or 'D' as their second character, such as Cd and Th ('D' because the code considers Deuterium the same as Hydrogen).

There are still quite a few discrepancies for the --hetatm option and also some of the other options, in the set you provided, so I'll try to fix them before we merge.

mittinatten commented 3 years ago

The only 2 remaining problematic files for --hetatm are 4uhc and 2vq3. Both have alternative conformations, so there might still be some edge case there that are handled differently when reading the two file formats.

There are no errors for --hydrogen.

Still a large some discrepancies for --separate-chains and --separate-models, but might be due to differences in ordering.

mittinatten commented 3 years ago

We can ignore 4uhc and 2vq3, the differences in the results are due to minor differences in the files, which I think it is not important to handle. Basically both the PDB versions have some oddities that have been fixed in the CIF version (wrong ordering of alternative conformations for the first one and spurious extra chain label in the second).

mittinatten commented 3 years ago

For separate-chains I found a bug in the CIF-implementation. An example is 5dx9, it has all its ATOMs in chain A and all its HETATMs in chain B. If we use --separate-chains with the PDB file chain B is filtered out before we separate the chains, so it is reported to only have one chain. For the CIF file we get the chains from the whole structure. Not sure what's the best way to solve this. I added a failing test to the suite to demonstrate the error.

danny305 commented 3 years ago

Nice catch! I'll take a look at this when I get a chance. I am preparing for 2 presentations this Friday so I have been kinda preoccupied.

danny305 commented 3 years ago

So I just checked out 5dx9. So not all the HETATMs are in chain B. The hetatms for the oligosaccharide (2-sugars in this case) are chain B and the remaining hetatms (Mg ion, beta-mercaptoethanol, and water) are a part of chain A. When I run it with --separate-chains --hetatm I get different freesasa values for the hetatms that are a part of chain A only (hetatms that are part of chain B match the pdb output). Im assuming this is the error you are talking about: the freesasa values for the hetatm of chain A do not match between the cif and pdb output.

I do not know what to do about this. Which ones are right? the cif ones or the pdb ones? How would this error occur? Aren't the freeSASA calculations the same for both? I thought the only difference was how the data is read in to build the freesasa_structure; once you have a freesasa_structure the calculation should be the same. At least that's my mental model of the situation.

mittinatten commented 3 years ago

Right, so the problem appears when you exclude hetatms, i.e. run without --hetatm. Then there is one chain that has 0 atoms. When reading from the .pdb-file the code reports only one chain, and when reading from the .cif-file we get two chains, but one is empty. It's not completely obvious what is the most intuitive behavior here, but I think it makes sense to stick with what's there from before to avoid breaking the interface. One simple solution would be to filter out freesasa_structures with 0 atoms when we separate chains. (I think that doesn't cause trouble elsehwere)

danny305 commented 3 years ago

Alright I implemented the removal of chains lacking polymer atoms. They will be filtered out in the process of creating a structure array when the --separate-chains options is provided. Im about to push. Let me know if this does indeed not cause trouble elsewhere.

mittinatten commented 3 years ago

🎉 https://github.com/mittinatten/freesasa/releases/tag/2.1.0-beta

mittinatten / freesasa

Support for CIF format #48

FreeSASA 2.0.3

FreeSASA 2.0.3