Open samirelanduk opened 6 years ago
In the past I have successfully managed to get a speed up from parallel processing PDB files using C++. The way to go, in my opinion, is as follows:
2a should be faster than 2b but you need to be careful with deadlocks and thread safety, which can be a pain to debug! In any case step 2 is where I think you could gain from parallel processing.
I am happy to help with this!
my two cents: the PDB and mmCIF parsers could be made 1-2 orders of magnitude faster, although not in pure Python. Then the parallel processing would not be needed. You may have a look at https://github.com/project-gemmi/mmcif-benchmark
Hi - thanks for the benchmark link. I hadn't seen this before and it will be very useful.
atomium 0.12, curently under development and hopefully out in the next few days, does have large speed increases, though still in pure Python (see this tweet). Moving to compiled code is a medium term goal for this library.
The multiprocessing library could speed up parts of the PDB parsing process - especially those parts that are just processing thousands of records.