ufal / neuralmonkey

An open-source tool for sequence learning in NLP built on TensorFlow.
BSD 3-Clause "New" or "Revised" License
412 stars 103 forks source link

Logging to non-relational database #519

Open jlibovicky opened 7 years ago

jlibovicky commented 7 years ago

I was thinking about way we log our experiments and think we should be able to mine much more information from the training process, so we can better understand it. However, it would require logging much more information than we currently do (changes in attention during training, visualizing attention over images, structured output) a many of them cannot be easily printed in console.

What I suggest is logging into a non-relational database and display the logs using either using a console client (which will show log it its current form) or a web client which could do some clever search.

After the logging will be done in this way, we can do take advantage of a shared file system and run validation and model assessments (embedding evaluation, ...) in a different process on a different machine writing into the same TF event file and into the same database.

What do you think?

tomasmcz commented 7 years ago

Why a non-relational database?

jlibovicky commented 7 years ago

We do not know in advance what will be the structure and size of what we will log (plain text, vector graphics, bitmap images, maybe html snippets) and I am not sure whether we can have a reasonable estimate of the record sizes while designing a database scheme. Another advantage is that at the end we can have dumps in JSON/whatever format distributed in the experiment directories which can the database easily access (similarly to TF events files). The only SQL database I have ever worked with is PostgreSQL and I don't think this could be easily doable using it. While having a scheme like (e.g. for validation): (time stamp, vector of some metrics results, serialized outputs of runner), we would not use the advantage of the SQL structure much.

However, I am not a database expert, so I might be wrong, maybe you know better. Would you prefer an SQL database? Do you think logging using a database is a good idea in general?

jindrahelcl commented 7 years ago

I think non-relation databases are better choice for this kind of thing. Mainly because of the structure argument - you don't want to have the structure fixed over time. Instead, you want to be able to add/remove properties that you log (columns/tables in the relational DB, if you wish) without any fuss.

The only operations you would want to perform on these databases is adding new documents (rows in relational DB lingo) and some kind of full-text search.

I think it is a good idea. Do you know any tools that would make implementing the logging side easy/ier?

Also, do we create one central database for our experiment, or will each of the users have his own database? It could be useful to search for some behavioral patterns in others' experiments, too. In my wild imagination, I'm thinking a git-like system with remotes and stuff.. I should stop now... :-)