rdkit / mmpdb

A package to identify matched molecular pairs and use them to predict property changes.
Other
198 stars 55 forks source link

Out of memory when indexing a large fragment file #6

Closed ValeryPolyakov closed 5 years ago

ValeryPolyakov commented 5 years ago

When I try to execute mmpdb index command on a large fragments file (>5M structures), Python runs out of memory even on a very big Linux node > 700Gb.

Can anything be done to process such big databases?

Thanks.

KramerChristian commented 5 years ago

Hi Valery,

with the code as it currently is, I do not think that can be done. You would probably have to rewrite parts of the indexing code to make it more memory efficient, use smaller environment radii, use smaller fragments etc. If you are interested in developing the code such that it can handle datasets as large as the one you have, I can give you some more ideas of what can be done. Andrew most probably would also have ideas of how to do it, and he may be available as consultant to work on the code.

Best regards, Christian

On Mon, 17 Sep 2018 at 18:45, ValeryPolyakov notifications@github.com wrote:

When I try to execute mmpdb index command on a large fragments file (>5M structures), Python runs out of memory even on a very big Linux node > 700Gb.

Can anything be done to process such big databases?

Thanks.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rdkit/mmpdb/issues/6, or mute the thread https://github.com/notifications/unsubscribe-auth/AddnenC8Q7tgjcOuOxahZbqVn-zgrW18ks5ub9G_gaJpZM4WsTif .

ValeryPolyakov commented 5 years ago

Hi Christian,

Thanks for your reply. I need to see if I have bandwidth to do something like this. Do you guys have means of appending the structures to the existing database?

Valery

On Mon, Sep 17, 2018 at 11:30 AM Christian Kramer notifications@github.com wrote:

Hi Valery,

with the code as it currently is, I do not think that can be done. You would probably have to rewrite parts of the indexing code to make it more memory efficient, use smaller environment radii, use smaller fragments etc. If you are interested in developing the code such that it can handle datasets as large as the one you have, I can give you some more ideas of what can be done. Andrew most probably would also have ideas of how to do it, and he may be available as consultant to work on the code.

Best regards, Christian

On Mon, 17 Sep 2018 at 18:45, ValeryPolyakov notifications@github.com wrote:

When I try to execute mmpdb index command on a large fragments file (>5M structures), Python runs out of memory even on a very big Linux node > 700Gb.

Can anything be done to process such big databases?

Thanks.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rdkit/mmpdb/issues/6, or mute the thread < https://github.com/notifications/unsubscribe-auth/AddnenC8Q7tgjcOuOxahZbqVn-zgrW18ks5ub9G_gaJpZM4WsTif

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rdkit/mmpdb/issues/6#issuecomment-422121953, or mute the thread https://github.com/notifications/unsubscribe-auth/ApV2_ns8TGyFIBXWPNQMtUUE8zg1WXAhks5ub-pLgaJpZM4WsTif .

adalke commented 5 years ago

As I recall, that code is pretty well optimized in terms of both performance and memory. Any changes may require using a special C extension. Without having access to the actual data, to see what is taking up the space, it's hard to judge.

You might consider changing some of the fragmentation parameters to reduce the number of fragments you generate, and/or change some of the indexing parameters to minimize that size. At the very least you could help pin down some scaling factors.

There is no way to append structures to the database. In some sense that would mean keeping the entire index in the database, in case a new pair can be made. It's probably not that hard to add new matches to an existing pair, if all you are interested in is increasing the statistics of previously known pairs.

For what it's worth, the largest Amazon EC2 node has 3,904 GiB of memory, or 5x of what you are currently using. It might be worthwhile to test there, especially if you already worked to get a better sense of your scaling factors so you can estimate if that's even a reasonable attempt.

KramerChristian commented 5 years ago

I agree with Andrew, we deliberately designed it without any function to update the database. We'd then rather recalculate the entire database separately with new data. If you want to use datasets as large as the one you mentioned, you could also design a "lazy" MMP algorithm that only indexes and calculates statistics for a given query. That would probably be slower queries and you would not be able to exchange statistics without revealing the underlying molecule structures, but it should work for databases of the size you have.

On Mon, 17 Sep 2018 at 21:41, Andrew Dalke notifications@github.com wrote:

As I recall, that code is pretty well optimized in terms of both performance and memory. Any changes may require using a special C extension. Without having access to the actual data, to see what is taking up the space, it's hard to judge.

You might consider changing some of the fragmentation parameters to reduce the number of fragments you generate, and/or change some of the indexing parameters to minimize that size. At the very least you could help pin down some scaling factors.

There is no way to append structures to the database. In some sense that would mean keeping the entire index in the database, in case a new pair can be made. It's probably not that hard to add new matches to an existing pair, if all you are interested in is increasing the statistics of previously known pairs.

For what it's worth, the largest Amazon EC2 node has 3,904 GiB of memory, or 5x of what you are currently using. It might be worthwhile to test there, especially if you already worked to get a better sense of your scaling factors so you can estimate if that's even a reasonable attempt.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/rdkit/mmpdb/issues/6#issuecomment-422144134, or mute the thread https://github.com/notifications/unsubscribe-auth/AddnenFKV5BwvQyCMu9-doI8AwxuqUpNks5ub_rsgaJpZM4WsTif .

ValeryPolyakov commented 5 years ago

Thanks everyone. Let me think about it.

On Mon, Sep 17, 2018 at 11:33 PM Christian Kramer notifications@github.com wrote:

I agree with Andrew, we deliberately designed it without any function to update the database. We'd then rather recalculate the entire database separately with new data. If you want to use datasets as large as the one you mentioned, you could also design a "lazy" MMP algorithm that only indexes and calculates statistics for a given query. That would probably be slower queries and you would not be able to exchange statistics without revealing the underlying molecule structures, but it should work for databases of the size you have.

On Mon, 17 Sep 2018 at 21:41, Andrew Dalke notifications@github.com wrote:

As I recall, that code is pretty well optimized in terms of both performance and memory. Any changes may require using a special C extension. Without having access to the actual data, to see what is taking up the space, it's hard to judge.

You might consider changing some of the fragmentation parameters to reduce the number of fragments you generate, and/or change some of the indexing parameters to minimize that size. At the very least you could help pin down some scaling factors.

There is no way to append structures to the database. In some sense that would mean keeping the entire index in the database, in case a new pair can be made. It's probably not that hard to add new matches to an existing pair, if all you are interested in is increasing the statistics of previously known pairs.

For what it's worth, the largest Amazon EC2 node has 3,904 GiB of memory, or 5x of what you are currently using. It might be worthwhile to test there, especially if you already worked to get a better sense of your scaling factors so you can estimate if that's even a reasonable attempt.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/rdkit/mmpdb/issues/6#issuecomment-422144134, or mute the thread < https://github.com/notifications/unsubscribe-auth/AddnenFKV5BwvQyCMu9-doI8AwxuqUpNks5ub_rsgaJpZM4WsTif

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rdkit/mmpdb/issues/6#issuecomment-422271940, or mute the thread https://github.com/notifications/unsubscribe-auth/ApV2_izc0SqWxf5b1Ogo06BH4cI5v7Dwks5ucJPQgaJpZM4WsTif .

ValeryPolyakov commented 5 years ago

Hi Christian,

I what file do I find your indexing code?

Thanks,

Valery

Valery R. Polyakov

On Tue, Sep 18, 2018 at 3:35 AM Valery Polyakov valery.polyakov@gmail.com wrote:

Thanks everyone. Let me think about it.

On Mon, Sep 17, 2018 at 11:33 PM Christian Kramer < notifications@github.com> wrote:

I agree with Andrew, we deliberately designed it without any function to update the database. We'd then rather recalculate the entire database separately with new data. If you want to use datasets as large as the one you mentioned, you could also design a "lazy" MMP algorithm that only indexes and calculates statistics for a given query. That would probably be slower queries and you would not be able to exchange statistics without revealing the underlying molecule structures, but it should work for databases of the size you have.

On Mon, 17 Sep 2018 at 21:41, Andrew Dalke notifications@github.com wrote:

As I recall, that code is pretty well optimized in terms of both performance and memory. Any changes may require using a special C extension. Without having access to the actual data, to see what is taking up the space, it's hard to judge.

You might consider changing some of the fragmentation parameters to reduce the number of fragments you generate, and/or change some of the indexing parameters to minimize that size. At the very least you could help pin down some scaling factors.

There is no way to append structures to the database. In some sense that would mean keeping the entire index in the database, in case a new pair can be made. It's probably not that hard to add new matches to an existing pair, if all you are interested in is increasing the statistics of previously known pairs.

For what it's worth, the largest Amazon EC2 node has 3,904 GiB of memory, or 5x of what you are currently using. It might be worthwhile to test there, especially if you already worked to get a better sense of your scaling factors so you can estimate if that's even a reasonable attempt.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/rdkit/mmpdb/issues/6#issuecomment-422144134, or mute the thread < https://github.com/notifications/unsubscribe-auth/AddnenFKV5BwvQyCMu9-doI8AwxuqUpNks5ub_rsgaJpZM4WsTif

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rdkit/mmpdb/issues/6#issuecomment-422271940, or mute the thread https://github.com/notifications/unsubscribe-auth/ApV2_izc0SqWxf5b1Ogo06BH4cI5v7Dwks5ucJPQgaJpZM4WsTif .

KramerChristian commented 5 years ago

Hi Valery,

The largest part of it is in the file indexing.py. However, to really understand it, I think you have to go through some other files as well, starting from commandline.py and do_index.py. As a little warning upfront, it will be rather complex...

Bests, Christian

ValeryPolyakov notifications@github.com schrieb am Di., 18. Sep. 2018, 23:53:

Hi Christian,

I what file do I find your indexing code?

Thanks,

Valery

Valery R. Polyakov

On Tue, Sep 18, 2018 at 3:35 AM Valery Polyakov <valery.polyakov@gmail.com

wrote:

Thanks everyone. Let me think about it.

On Mon, Sep 17, 2018 at 11:33 PM Christian Kramer < notifications@github.com> wrote:

I agree with Andrew, we deliberately designed it without any function to update the database. We'd then rather recalculate the entire database separately with new data. If you want to use datasets as large as the one you mentioned, you could also design a "lazy" MMP algorithm that only indexes and calculates statistics for a given query. That would probably be slower queries and you would not be able to exchange statistics without revealing the underlying molecule structures, but it should work for databases of the size you have.

On Mon, 17 Sep 2018 at 21:41, Andrew Dalke notifications@github.com wrote:

As I recall, that code is pretty well optimized in terms of both performance and memory. Any changes may require using a special C extension. Without having access to the actual data, to see what is taking up the space, it's hard to judge.

You might consider changing some of the fragmentation parameters to reduce the number of fragments you generate, and/or change some of the indexing parameters to minimize that size. At the very least you could help pin down some scaling factors.

There is no way to append structures to the database. In some sense that would mean keeping the entire index in the database, in case a new pair can be made. It's probably not that hard to add new matches to an existing pair, if all you are interested in is increasing the statistics of previously known pairs.

For what it's worth, the largest Amazon EC2 node has 3,904 GiB of memory, or 5x of what you are currently using. It might be worthwhile to test there, especially if you already worked to get a better sense of your scaling factors so you can estimate if that's even a reasonable attempt.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/rdkit/mmpdb/issues/6#issuecomment-422144134, or mute the thread <

https://github.com/notifications/unsubscribe-auth/AddnenFKV5BwvQyCMu9-doI8AwxuqUpNks5ub_rsgaJpZM4WsTif

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rdkit/mmpdb/issues/6#issuecomment-422271940, or mute the thread < https://github.com/notifications/unsubscribe-auth/ApV2_izc0SqWxf5b1Ogo06BH4cI5v7Dwks5ucJPQgaJpZM4WsTif

.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/rdkit/mmpdb/issues/6#issuecomment-422569498, or mute the thread https://github.com/notifications/unsubscribe-auth/AddneoeLzTEgpUA1s2RMfSkdy4U9HNqsks5ucWs8gaJpZM4WsTif .

ValeryPolyakov commented 5 years ago

Thanks, Let me at least take a cursory look at it.

Valery R. Polyakov

On Wed, Sep 19, 2018 at 12:07 AM Christian Kramer notifications@github.com wrote:

Hi Valery,

The largest part of it is in the file indexing.py. However, to rwally understand it, I think you have to go through some other files as well, starting from commandline.py and do_index.py. As a little warning upfront, it will be rather complex...

Bests, Christian

ValeryPolyakov notifications@github.com schrieb am Di., 18. Sep. 2018, 23:53:

Hi Christian,

I what file do I find your indexing code?

Thanks,

Valery

Valery R. Polyakov

On Tue, Sep 18, 2018 at 3:35 AM Valery Polyakov < valery.polyakov@gmail.com

wrote:

Thanks everyone. Let me think about it.

On Mon, Sep 17, 2018 at 11:33 PM Christian Kramer < notifications@github.com> wrote:

I agree with Andrew, we deliberately designed it without any function to update the database. We'd then rather recalculate the entire database separately with new data. If you want to use datasets as large as the one you mentioned, you could also design a "lazy" MMP algorithm that only indexes and calculates statistics for a given query. That would probably be slower queries and you would not be able to exchange statistics without revealing the underlying molecule structures, but it should work for databases of the size you have.

On Mon, 17 Sep 2018 at 21:41, Andrew Dalke notifications@github.com wrote:

As I recall, that code is pretty well optimized in terms of both performance and memory. Any changes may require using a special C extension. Without having access to the actual data, to see what is taking up the space, it's hard to judge.

You might consider changing some of the fragmentation parameters to reduce the number of fragments you generate, and/or change some of the indexing parameters to minimize that size. At the very least you could help pin down some scaling factors.

There is no way to append structures to the database. In some sense that would mean keeping the entire index in the database, in case a new pair can be made. It's probably not that hard to add new matches to an existing pair, if all you are interested in is increasing the statistics of previously known pairs.

For what it's worth, the largest Amazon EC2 node has 3,904 GiB of memory, or 5x of what you are currently using. It might be worthwhile to test there, especially if you already worked to get a better sense of your scaling factors so you can estimate if that's even a reasonable attempt.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/rdkit/mmpdb/issues/6#issuecomment-422144134, or mute the thread <

https://github.com/notifications/unsubscribe-auth/AddnenFKV5BwvQyCMu9-doI8AwxuqUpNks5ub_rsgaJpZM4WsTif

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rdkit/mmpdb/issues/6#issuecomment-422271940, or mute the thread <

https://github.com/notifications/unsubscribe-auth/ApV2_izc0SqWxf5b1Ogo06BH4cI5v7Dwks5ucJPQgaJpZM4WsTif

.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/rdkit/mmpdb/issues/6#issuecomment-422569498, or mute the thread < https://github.com/notifications/unsubscribe-auth/AddneoeLzTEgpUA1s2RMfSkdy4U9HNqsks5ucWs8gaJpZM4WsTif

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rdkit/mmpdb/issues/6#issuecomment-422682312, or mute the thread https://github.com/notifications/unsubscribe-auth/ApV2_t-h_7xdCOy7O0yk0XHGM-uVAY6Tks5uce03gaJpZM4WsTif .

KramerChristian commented 5 years ago

Hi Valery,

I just pushed a new version of mmpdb to github that allows reducing the database size, enabling larger datasets. To do that, I basically I added three options:

--max-radius during indexing: The default is 5, but you can reduce it here to lower numbers down to 0, This significantly reduces the DB size.

--min-heavies-per-const-frag: Double- and triple cuts can generate odd fragmentations where one const part consists of only 1 or 2 atoms. If you use this new option during fragmentation, you can remove those fragmentations. Note that you will typically still find the pairs, since they are also linked by a single-cut transformation (if your max-heavies option allows that).

--smallest-transformation-only: If you use this option during indexing, mmpdb will discard transformations that can be reduced to smaller transformations. As example, p-F-phenyl >> p-Cl-phenyl will not be added to the database, since it can be reduced to F>>Cl.

In case you are still interested in mmpdb on larger datasets, I would be happy about your feedback on these new options.

Best regards, Christian

ValeryPolyakov commented 5 years ago

Hi Christian,

Thanks for telling me. I will try it ASAP. I really need this option. I will e-mail you if I have any questions.

Valery R. Polyakov

On Mon, Jan 14, 2019 at 1:41 AM Christian Kramer notifications@github.com wrote:

Hi Valery,

I just pushed a new version of mmpdb to github that allows reducing the database size, enabling larger datasets. To do that, I basically I added three options:

--max-radius during indexing: The default is 5, but you can reduce it here to lower numbers down to 0, This significantly reduces the DB size.

--min-heavies-per-const-frag: Double- and triple cuts can generate odd fragmentations where one const part consists of only 1 or 2 atoms. If you use this new option during fragmentation, you can remove those fragmentations. Note that you will typically still find the pairs, since they are also linked by a single-cut transformation (if your max-heavies option allows that).

--smallest-transformation-only: If you use this option during indexing, mmpdb will discard transformations that can be reduced to smaller transformations. As example, p-F-phenyl >> p-Cl-phenyl will not be added to the database, since it can be reduced to F>>Cl.

In case you are still interested in mmpdb on larger datasets, I would be happy about your feedback on these new options.

Best regards, Christian

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rdkit/mmpdb/issues/6#issuecomment-453946400, or mute the thread https://github.com/notifications/unsubscribe-auth/ApV2_lhco0ZBFRR_QNcdRstFLvOtu4fBks5vDFDAgaJpZM4WsTif .

ValeryPolyakov commented 5 years ago

Hi Christian,

I am still testing the new options and it seems to me that they partially help. The reason, I say partially, it because the indexing command now allows to read all fragments, whereas before it chocked on ~10% of them, but I am still getting an error during the indexing phase. The error says that it is an SQL error and there is not enough space on the disk, which is not correct. I have plenty of disk space. I am using the following new options:

--max-radius 0 and --smallest-transformation-only

I also have a couple of questions:

  1. What are the acceptable (default and recommended for size reduction) values for --min-heavies-per-const-frag
  2. How do you treat qualified data in your database? Currently, we are just removing them, since you never know ahead of time which combination would they be paired with. Can you suggest a better solution?

Thanks a lot,

Valery

On Mon, Jan 14, 2019 at 8:23 AM Valery Polyakov valery.polyakov@gmail.com wrote:

Hi Christian,

Thanks for telling me. I will try it ASAP. I really need this option. I will e-mail you if I have any questions.

Valery R. Polyakov

On Mon, Jan 14, 2019 at 1:41 AM Christian Kramer notifications@github.com wrote:

Hi Valery,

I just pushed a new version of mmpdb to github that allows reducing the database size, enabling larger datasets. To do that, I basically I added three options:

--max-radius during indexing: The default is 5, but you can reduce it here to lower numbers down to 0, This significantly reduces the DB size.

--min-heavies-per-const-frag: Double- and triple cuts can generate odd fragmentations where one const part consists of only 1 or 2 atoms. If you use this new option during fragmentation, you can remove those fragmentations. Note that you will typically still find the pairs, since they are also linked by a single-cut transformation (if your max-heavies option allows that).

--smallest-transformation-only: If you use this option during indexing, mmpdb will discard transformations that can be reduced to smaller transformations. As example, p-F-phenyl >> p-Cl-phenyl will not be added to the database, since it can be reduced to F>>Cl.

In case you are still interested in mmpdb on larger datasets, I would be happy about your feedback on these new options.

Best regards, Christian

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rdkit/mmpdb/issues/6#issuecomment-453946400, or mute the thread https://github.com/notifications/unsubscribe-auth/ApV2_lhco0ZBFRR_QNcdRstFLvOtu4fBks5vDFDAgaJpZM4WsTif .

KramerChristian commented 5 years ago

Hi Valery,

can you add the error message, and ideally give me some more information about the dataset size that you want to index (number of compounds) and your system RAM size? I would like to better understand and reproduce the error you get. I guess that the error is related to running out of RAM, rather than running out of disk space.

Another test you could run is to write the output to a .csv file instead of a .mmpdb DB. If it is a SQL error that is RAM related, this should work because creating the .csv output uses much less RAM.

About the other questions: 1.) the default for --min-heavies-per-const-frag is 0. If you want to use this, you have to provide a number. To me, 3 made sense. This removes double- and triple cuts where a single halogen or a emthoxy is split off.

2.) Qualified data. At this point in time, I would also remove it. If you keep it and ignore the sign, the statistics calculated do not make muhc sense any more. To handle qualified data, on would have to add different types of statistics that are not based on normal distributions but rather just evaluate how often a property changes up or down for agiven transformation. This is currently not in the code, but would be a nice feature to add. I cannot promise, however, that I can do that in the near future.

Bests, Christian

ValeryPolyakov commented 5 years ago

Thanks Christian, I will try to capture the error message. By the way, I do the processing on a 1.5T node, but it is 5.5M compounds.

On Thu, Feb 7, 2019 at 10:52 PM Christian Kramer notifications@github.com wrote:

Reopened #6 https://github.com/rdkit/mmpdb/issues/6.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rdkit/mmpdb/issues/6#event-2126334850, or mute the thread https://github.com/notifications/unsubscribe-auth/ApV2_rLdceDkKPIv_YeuUwKngTR_KI_cks5vLR7IgaJpZM4WsTif .

ValeryPolyakov commented 5 years ago

Hi Christian,

Here is the error: "sqlite3.OperationalError: database or disk full"

Also, when I use the following command line for fragmentation. I did not observe the same problem if I am processing without --min-heavies-per-const-frag 3:

python ../mmpdb fragment AllMagmaComp_08_2018.smi -o MAGMA1.fragment --min-heavies-per-const-frag 3

I am getting an error message regardless of what compound is in the first line (I masked the smiles with Xs):

Preparing record 5441014[18:42:48] Conflicting single bond directions around double bond at index 6. [18:42:48] BondStereo set to STEREONONE and single bond directions set to NONE. [18:42:48] Conflicting single bond directions around double bond at index 34. [18:42:48] BondStereo set to STEREONONE and single bond directions set to NONE. Failure: file 'AllMagmaComp_08_2018.smi', line 1, record #1: first line starts 'XXXXXXXXXXXXXXXXXXXXXX CG ...' Traceback (most recent call last): File "../mmpdb", line 11, in commandline.main() File "/home/polyava1/PythonCode/mmpdb22/mmpdblib/commandline.py", line 1003, in main parsed_args.command(parsed_args.subparser, parsed_args) File "/home/polyava1/PythonCode/mmpdb22/mmpdblib/commandline.py", line 178, in fragment_command do_fragment.fragment_command(parser, args) File "/home/polyava1/PythonCode/mmpdb22/mmpdblib/do_fragment.py", line 549, in fragment_command writer.write_records(records) File "/home/polyava1/PythonCode/mmpdb22/mmpdblib/fragment_io.py", line 404, in write_records for rec in fragment_records: File "/home/polyava1/PythonCode/mmpdb22/mmpdblib/do_fragment.py", line 464, in make_fragment_records fragments) UnboundLocalError: local variable 'fragments' referenced before assignment [p

On Fri, Feb 8, 2019 at 9:26 AM Valery Polyakov valery.polyakov@gmail.com wrote:

Thanks Christian, I will try to capture the error message. By the way, I do the processing on a 1.5T node, but it is 5.5M compounds.

On Thu, Feb 7, 2019 at 10:52 PM Christian Kramer notifications@github.com wrote:

Reopened #6 https://github.com/rdkit/mmpdb/issues/6.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rdkit/mmpdb/issues/6#event-2126334850, or mute the thread https://github.com/notifications/unsubscribe-auth/ApV2_rLdceDkKPIv_YeuUwKngTR_KI_cks5vLR7IgaJpZM4WsTif .

adalke commented 5 years ago

This is an odd error message. The text

Failure: file 'AllMagmaComp_08_2018.smi', line 1, record #1: first line starts 'XXXXXXXXXXXXXXXXXXXXXX CG ...'

comes from do_fragment.py

            except Exception:
                # Something unexpected happened.
                # Give some idea of what failed.
                reporter.update("")
                print("Failure:", where, file=sys.stderr)
                raise

It should re-raise the exception (because of the 'raise' on the last line of that snippet), which would report the entire error message.

Instead, in your message you write that it reports:

 File "/home/polyava1/PythonCode/mmpdb22/mmpdblib/do_fragment.py", line 464, in make_fragment_records fragments) UnboundLocalError: local variable 'fragments' referenced before assignment

This comes from the code immediately after the 'except Exception' block.

It's as if the 'raise' line was commented out or otherwise removed.

The full code should look like (starting around line 454):

            except Exception:
                # Something unexpected happened.
                # Give some idea of what failed.
                reporter.update("")
                print("Failure:", where, file=sys.stderr)
                raise

            yield fragment_types.FragmentRecord(
                id, input_smiles,
                record.num_normalized_heavies, record.normalized_smiles,
                fragments)

Has it changed any in your code?

KramerChristian commented 5 years ago

Hi Valery,

to help me understand the errors you are getting, can you tell me whether this is one error or two different errors? I am asking because I first understood that you get an error during indexing, but now it seems you are also getting an error during fragmentation.

Does the sqlite error message come during indexing? It would be suprising if it came during fragmentation, since that does not involve sqlite. At the moment, I can think of two reasons: 1.) the SQLite database becomes too big: as discussed here https://stackoverflow.com/questions/16685016/sqlite3-operationalerror-database-or-disk-is-full-on-lustre This error is thrown if the SQLite DB becomes too big. It is independent of the free disk space. Can you check the size of the SQLite DB when the code stopped working? There seems to be a workaround as indicated in the post. However, this needs to be added at several places in the code, so I would like to first know whether this really is the issue here. (Even if it is, I wonder whether you run into other memory related issue if working with a dataset of the 5,1 Mio compounds - we shall see.)

2.) You run out of RAM Your system RAM sets an upper limit of the dataset size during indexing. During indexing the .mmpdb database, dictionaries for the rules and rule environments are generated. If those grow too big, then python runs out of RAM. Where exactly this limit is depends on your dataset and the settings/ filters you use for indexing. I would recommend you play with the settings and find those limits. Writing the pairs to a .csv file however should work, since this does not build the dictionaries necessary for the .mmpdb database. Can you test whether you can write the pairs to a .csv file? This would anyway be a good test to run, because it will give you an estimate of the number of pairs you will generate (and the DB size you will need).

The second error you mention I believe occurs during fragmentation. Is this true? I fragmented one of my test files with the settings you used, but that runs through smoothly. Can you create a case that fails and that you can share?

Bests, Christian

ValeryPolyakov commented 5 years ago

Hi Andrew,

You are absolutely right. I had to comment out that raise statement on line

  1. It was causing an exception on one molecule during the first pass. Unfortunately, I cannot disclose this molecule, but it is really ugly and should be skipped.

Valery R. Polyakov

On Tue, Feb 12, 2019 at 1:42 AM Andrew Dalke notifications@github.com wrote:

This is an odd error message. The text

Failure: file 'AllMagmaComp_08_2018.smi', line 1, record #1: first line starts 'XXXXXXXXXXXXXXXXXXXXXX CG ...'

comes from do_fragment.py

        except Exception:
            # Something unexpected happened.
            # Give some idea of what failed.
            reporter.update("")
            print("Failure:", where, file=sys.stderr)
            raise

It should re-raise the exception (because of the 'raise' on the last line of that snippet), which would report the entire error message.

Instead, in your message you write that it reports:

File "/home/polyava1/PythonCode/mmpdb22/mmpdblib/do_fragment.py", line 464, in make_fragment_records fragments) UnboundLocalError: local variable 'fragments' referenced before assignment

This comes from the code immediately after the 'except Exception' block.

It's as if the 'raise' line was commented out or otherwise removed.

The full code should look like (starting around line 454):

        except Exception:
            # Something unexpected happened.
            # Give some idea of what failed.
            reporter.update("")
            print("Failure:", where, file=sys.stderr)
            raise

        yield fragment_types.FragmentRecord(
            id, input_smiles,
            record.num_normalized_heavies, record.normalized_smiles,
            fragments)

Has it changed any in your code?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rdkit/mmpdb/issues/6#issuecomment-462688880, or mute the thread https://github.com/notifications/unsubscribe-auth/ApV2_jdX01CihVL59bQU7u8Lh7nYx_Phks5vMoySgaJpZM4WsTif .

ValeryPolyakov commented 5 years ago

Hi Christian,

The error comes up during indexing after all fragments are read. After that the message says that it starts indexing and it fails after an hour or two. The database becomes big. I cannot tell the exact size, because it resizes it at some point. I will try to reproduce this again and maybe I can capture the moments.

It does not look like I have RAM issue. The node has 1.5 Tb and I am reserving 1.2T.

Valery

On Tue, Feb 12, 2019 at 3:54 AM Christian Kramer notifications@github.com wrote:

Hi Valery,

to help me understand the errors you are getting, can you tell me whether this one error or two different errors? I am asking because I first understood that you get an error during indexing, but now it seems you are getting an error during fragmentation.

Does the sqlite error message come during indexing? It would be suprising if it came during fragmentation, since that does not involve sqlite. At the moment, I can think of two reasons: 1.) the SQLite database becomes too big: as discussed here

https://stackoverflow.com/questions/16685016/sqlite3-operationalerror-database-or-disk-is-full-on-lustre This error is thrown if the SQLite DB becomes too big. It is independent of the free disk space. Can you check the size of the SQLite DB when the code stopped working? There seems to be a workaround as indicated in the post. However, this needs to be added at several places in the code, so I would like to first know whether this really is the issue here. (Even if it is, I wonder whether you run into other memory related issue if working with a dataset of the 5,1 Mio compounds - we shall see.)

2.) You run out of RAM Your system RAM sets an upper limit of the dataset size during indexing. During the indexing the .mmpdb database, dictionaries for the rules and rule environments are generated. If those grow too big, then python runs out of RAM. Where exactly this limit is depends on your dataset and the settings/ filters you use for indexing. I would recommend you play with the settings and find those limits. Writing the pairs to a .csv file however should work, since this does not build the dictionaries necessary for the .mmpdb database. Can you test whether you can write the pairs to a .csv file?

The second error you mention I believe occurs during fragmentation. Is this true? I fragmented one of my test files with the settings you used, but that runs through smoothly. Can you create a case that fails and that you can share?

Bests, Christian

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rdkit/mmpdb/issues/6#issuecomment-462731476, or mute the thread https://github.com/notifications/unsubscribe-auth/ApV2_s7qxNCuipghOq9_jz9ZH7b4IbSSks5vMquMgaJpZM4WsTif .

adalke commented 5 years ago

If a structure causes a problem at that point then it likely indicates a bug in RDKit. The input structure is parsed and cleaned up at this point, to give a new canonical SMILES.

I don't need to see the SMILES. What's missing is the full traceback, which you eliminated by removing the "raise". That would pinpoint the failing function. Would re-run with the unmodified code and report the traceback?

ValeryPolyakov commented 5 years ago

Hi Andrew,

Here is the error message, I removed the actual smile.

Failure: file 'AllMagmaComp_08_2018.smi', line 1, record #1: first line starts 'XXXXXXXXXXXXXXXXXXXX CG ...' multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/usr/prog/anaconda/3.5-4.0.0/envs/3.6-cadd-devel/lib/python3.6/multiprocessing/pool.py", line 119, in worker result = (True, func(*args, **kwds)) File "/home/polyava1/PythonCode/mmpdb22/mmpdblib/do_fragment.py", line 377, in _as_list return list(method(normalized_mol, fragment_filter, num_normalized_heavies)) File "/home/polyava1/PythonCode/mmpdb22/mmpdblib/fragment_algorithm.py", line 721, in fragment_mol for x in _fragment_mol(mol, fragment_filter, num_heavies): File "/home/polyava1/PythonCode/mmpdb22/mmpdblib/fragment_algorithm.py", line 757, in _fragment_mol key = fragmentation.get_unique_key() # XXX + "012" + YYY AttributeError: 'NoneType' object has no attribute 'get_unique_key' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "../mmpdb", line 11, in commandline.main() File "/home/polyava1/PythonCode/mmpdb22/mmpdblib/commandline.py", line 1003, in main parsed_args.command(parsed_args.subparser, parsed_args) File "/home/polyava1/PythonCode/mmpdb22/mmpdblib/commandline.py", line 178, in fragment_command do_fragment.fragment_command(parser, args) File "/home/polyava1/PythonCode/mmpdb22/mmpdblib/do_fragment.py", line 549, in fragment_command writer.write_records(records) File "/home/polyava1/PythonCode/mmpdb22/mmpdblib/fragment_io.py", line 404, in write_records for rec in fragment_records: File "/home/polyava1/PythonCode/mmpdb22/mmpdblib/do_fragment.py", line 447, in make_fragment_records fragments = result.get() File "/usr/prog/anaconda/3.5-4.0.0/envs/3.6-cadd-devel/lib/python3.6/multiprocessing/pool.py", line 644, in get raise self._value AttributeError: 'NoneType' object has no attribute 'get_unique_key'

Valery R. Polyakov

On Tue, Feb 12, 2019 at 12:36 PM Andrew Dalke notifications@github.com wrote:

If a structure causes a problem at that point then it likely indicates a bug in RDKit. The input structure is parsed and cleaned up at this point, to give a new canonical SMILES.

I don't need to see the SMILES. What's missing is the full traceback, which you eliminated by removing the "raise". That would pinpoint the failing function. Would re-run with the unmodified code and report the traceback?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rdkit/mmpdb/issues/6#issuecomment-462924450, or mute the thread https://github.com/notifications/unsubscribe-auth/ApV2_rdP12HItASBovOlmNCP4R4aqJvsks5vMyXmgaJpZM4WsTif .

adalke commented 5 years ago

Okay, I think know the problem. Christian added some code on 14 Jan 2019, which included a "yield None" at https://github.com/rdkit/mmpdb/commit/901674ed3bdaf9031ea6a6210ed22e02267ce33c#diff-c040b7b07a919f39b9e763cbacf45a1eR582 . It was not needed, and would cause the problem you have.

I think that's the time when you updated your copy of mmpdb.

I caught that in code review. Christian fixed it on 5 Feb 2019.

So the solution to this specific problem might be as simple as updating to the most recent code, which contains that fix.

You can check if you have the most recent version, by looking for "yield None" in mmpdblib/fragment_algorithm.py . If it isn't present, then it's pre-14 Jan. If it is commented out then it's after 5 Feb and is the most recent version. If it's not commented out, then it's the version with the bug.

ValeryPolyakov commented 5 years ago

Andrew and Christian,

I was able to successfully run fragmentation using the following command:

python ../mmpdb fragment AllMagmaComp_08_2018.smi -o MAGMA.fragment --min-heavies-per-const-frag 3

It indeed reduces the size of the database by the factor of 3.

Also, the indexing failed on the new database. Here is the error message.

[11:29:40] BondStereo set to STEREONONE and single bond directions set to NONE. Building index ...Failed to execute the following SQL: CREATE INDEX rule_environment_rule_id on rule_environment (rule_id);Traceback (most recent call last): File "../mmpdb", line 11, in commandline.main() File "/home/polyava1/PythonCode/mmpdb22/mmpdblib/commandline.py", line 1003, in main parsed_args.command(parsed_args.subparser, parsed_args) File "/home/polyava1/PythonCode/mmpdb22/mmpdblib/commandline.py", line 342, in index_command do_index.index_command(parser, args) File "/home/polyava1/PythonCode/mmpdb22/mmpdblib/do_index.py", line 205, in index_command pair_writer.end(reporter) File "/home/polyava1/PythonCode/mmpdb22/mmpdblib/index_algorithm.py", line 1182, in end self.backend.end(reporter) File "/home/polyava1/PythonCode/mmpdb22/mmpdblib/index_writers.py", line 215, in end schema.create_index(self.conn) File "/home/polyava1/PythonCode/mmpdb22/mmpdblib/schema.py", line 133, in create_index _execute_sql(c, get_create_index_sql()) File "/home/polyava1/PythonCode/mmpdb22/mmpdblib/schema.py", line 119, in _execute_sql c.execute(statement) sqlite3.OperationalError: database or disk is full

Valery R. Polyakov

On Wed, Feb 13, 2019 at 12:54 PM Andrew Dalke notifications@github.com wrote:

Okay, I think know the problem. Christian added some code on 14 Jan 2019, which included a "yield None" at 901674e

diff-c040b7b07a919f39b9e763cbacf45a1eR582

https://github.com/rdkit/mmpdb/commit/901674ed3bdaf9031ea6a6210ed22e02267ce33c#diff-c040b7b07a919f39b9e763cbacf45a1eR582 . It was not needed, and would cause the problem you have.

I think that's the time when you updated your copy of mmpdb.

I caught that in code review. Christian fixed it on 5 Feb 2019.

So the solution to this specific problem might be as simple as updating to the most recent code, which contains that fix.

You can check if you have the most recent version, by looking for "yield None" in mmpdblib/fragment_algorithm.py . If it isn't present, then it's pre-14 Jan. If it is commented out then it's after 5 Feb and is the most recent version. If it's not commented out, then it's the version with the bug.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rdkit/mmpdb/issues/6#issuecomment-463366098, or mute the thread https://github.com/notifications/unsubscribe-auth/ApV2_vZAQFoPkLGi6mPqS7O6GiklStFiks5vNHuSgaJpZM4WsTif .

bodowd commented 5 years ago

Hello,

I am running into the same error.

My dataset has ~300,000 molecules. I can run all the way to "mmpdb index" command before I get the "sqlite3.OperationalError: database or disk is full" error.

I ran index to a ".csv" file and that worked. When I try to run it to ".mmpdb" is when the error appears.

I also ran the mmpdb index successfully, without error, on a 20k subset of my data.

KramerChristian commented 5 years ago

Hi Valery and bodowd,

I am not quite sure what exactly causes the sqlite3 OperationalError. From what I read on it so far, there can be 2 reasons: 1.) The SQLite DB has exhausted all pages available to it, as described here: https://stackoverflow.com/questions/16685016/sqlite3-operationalerror-database-or-disk-is-full-on-lustre

or 2.) The temporary directory that SQLite uses is full, as described here: https://recursiveramblings.wordpress.com/2014/01/23/working-around-sqlite3-operationalerror-database-or-disk-is-full/

To figure out what causes the issue, I need your help. Issue 2 seems to be easier to track down, so let's start with that one. According to this page (https://sqlite.org/tempfiles.html), the temporary directory for SQLite databases can be set using an environmental variable SQLITE_TMPDIR, if it is not defined within SQLite.

Could you start sqlite in the shell (>sqlite3) and type

"pragma temp_store_directory;"

Unless you use an old version of SQlite, this should give you no result, since the pragma is deprecated. If you get no result, could you exit sqlite and next try setting the SQLITE_TMPDIR variable to the current directory you are writing your SQLite DB into? Under Linux using BASH and Valery's working directory being "home/polyava1/mmpdb_test/", that would be

export SQLITE_TMPDIR="home/polyava1/mmpdb_test/"

before you run mmpdb in the same shell.

Could you give that a try and let me know whether it solves the issue? We may run into other size-related issues with the 5 Mio cmp database, but 300K molecules should work.

Bests, Christian

bodowd commented 5 years ago

Hi Christian,

setting the SQLITE_TMPDIR as you instructed worked. Index runs smoothly now for me.

Thank you

ValeryPolyakov commented 5 years ago

I will take a little longer for me, but I am doing it now.

On Wed, Feb 27, 2019 at 7:52 AM bodowd notifications@github.com wrote:

Hi Christian,

setting the SQLITE_TMPDIR as you instructed worked. Index runs smoothly now for me.

Thank you

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rdkit/mmpdb/issues/6#issuecomment-467916524, or mute the thread https://github.com/notifications/unsubscribe-auth/ApV2_kz4hfeaOxpRDo27ZHyBRsypD7MDks5vRqm5gaJpZM4WsTif .

KramerChristian commented 5 years ago

Hi Valery,

does setting the SQLITE_TMPDIR variable solve the problem for you? If yes, I would like to close this issue.

Bests, Christian

ValeryPolyakov commented 5 years ago

Hi Christian,

Still did not have the ability to test it. The large nodes on out cluster are not available for some reason. I am in the queue to get one of them.

Valery

On Tue, Mar 12, 2019 at 1:20 AM Christian Kramer notifications@github.com wrote:

Hi Valery,

does setting the SQLITE_TMPDIR variable solve the problem for you? If yes, I would like to close this issue.

Bests, Christian

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rdkit/mmpdb/issues/6#issuecomment-471900192, or mute the thread https://github.com/notifications/unsubscribe-auth/ApV2_lIalV7nSN87CUZPw-_Tsarr-muFks5vV2NDgaJpZM4WsTif .

ValeryPolyakov commented 5 years ago

Hi Christian,

I was finally able to test your suggestion for setting SQLITE_TMPDIR. It works even for a very large database!

I really appreciate your help.

Valery

On Tue, Mar 12, 2019 at 5:05 AM Valery Polyakov valery.polyakov@gmail.com wrote:

Hi Christian,

Still did not have the ability to test it. The large nodes on out cluster are not available for some reason. I am in the queue to get one of them.

Valery

On Tue, Mar 12, 2019 at 1:20 AM Christian Kramer notifications@github.com wrote:

Hi Valery,

does setting the SQLITE_TMPDIR variable solve the problem for you? If yes, I would like to close this issue.

Bests, Christian

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rdkit/mmpdb/issues/6#issuecomment-471900192, or mute the thread https://github.com/notifications/unsubscribe-auth/ApV2_lIalV7nSN87CUZPw-_Tsarr-muFks5vV2NDgaJpZM4WsTif .

KramerChristian commented 5 years ago

Thank you for your feedback, Valery. Since setting the SQLITE_TMPDIR environment variable to a folder with enough space appears to solve the problem, I will close this issue.

ValeryPolyakov commented 5 years ago

Cool. Thanks.

On Tue, Mar 19, 2019, 5:21 AM Christian Kramer notifications@github.com wrote:

Closed #6 https://github.com/rdkit/mmpdb/issues/6.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rdkit/mmpdb/issues/6#event-2213370459, or mute the thread https://github.com/notifications/unsubscribe-auth/ApV2_sF8GeBDjKZaJ8BDBMzrAb3Dx-vlks5vYNY5gaJpZM4WsTif .