jasmineRepo / JAS-mine-core

JAS-mine maintains and develops the JAS simulation platform, a discrete-event tool-kit for agent-based and dynamic microsimulation modelling. This repository contains the core libraries. See www.jas-mine.net for more details.
2 stars 5 forks source link

Alignment changes #19

Closed vkhodygo closed 2 years ago

vkhodygo commented 2 years ago

This is a cumulative update, I work on the alignment at the moment and I need some feedback before I start doing anything else. @pbronka I saw comments in the code regarding legacy methods, but there is no clear indication of this anywhere. Do you know which exactly alignment procedures are not used anymore?

As I move certain common code blocks to independent interfaces and classes, less code is easier to work with (that includes writing tests and documentation).

This is a WIP PR, it is to be updated after your initial review (please, note, that there are other commits that I pushed to this branch some time ago).

pbronka commented 2 years ago

Thank you for opening the pull request, it sure is a lot of changes! I will need some time to have a look – please let me know if there are any crucial parts that I should comment on and I will prioritise them. In terms of alignment, in LABSim we mostly use ResamplingAlignment and a custom version of iterative proportional fitting algorithm (for “SBAM” matching) that could probably be replaced with a version based on LogitScaling. We also use IterativeRandomMatching and IterativeSimpleMatching for matching purposes. These are the most important for LABSim and it may be worth prioritising them but I would not remove other methods for alignment / matching that are already in JAS-mine.

vkhodygo commented 2 years ago

HI @pbronka

Thanks for your feedback.

it sure is a lot of changes

Most of them are of cosmetic nature:

I will need some time to have a look

No worries, take your time.

any crucial parts

None in general.

I need to know only one thing at the moment: which alignment classes here are not in use anymore (legacy) and could be removed without any serious consequences: https://github.com/jasmineRepo/JAS-mine-core/blob/fa07b21137cbdd71af841354d2bdf0aa83bd9b3c/microsim-core/src/main/java/microsim/alignment/multiple/AbstractMultiProbabilityAlignment.java#L12

As far as I understand LogitScaling can be considered a superior algorithm, at least when it comes to mean adjustment, however, my knowledge is kind of superficial.

The logit binary scaling is removed since there is a proper version of the same numerical scheme for multiple outcomes. This simplifies testing and further support.

pbronka commented 2 years ago

HI @pbronka

Thanks for your feedback.

it sure is a lot of changes

Most of them are of cosmetic nature:

  • removing dead and outdated chunks;
  • refactoring code - that includes splitting and merging it - to make tests and future performance updates (if any) easier

I will need some time to have a look

No worries, take your time.

any crucial parts

None in general.

I need to know only one thing at the moment: which alignment classes here are not in use anymore (legacy) and could be removed without any serious consequences:

https://github.com/jasmineRepo/JAS-mine-core/blob/fa07b21137cbdd71af841354d2bdf0aa83bd9b3c/microsim-core/src/main/java/microsim/alignment/multiple/AbstractMultiProbabilityAlignment.java#L12

As far as I understand LogitScaling can be considered a superior algorithm, at least when it comes to mean adjustment, however, my knowledge is kind of superficial.

The logit binary scaling is removed since there is a proper version of the same numerical scheme for multiple outcomes. This simplifies testing and further support.

Logit Scaling Binary Alignment and its weighted version were used in LABSim but are currently commented out and resampling alignment is used. If they were to be brought back this will need to be updated to the (Multiple) Logit Scaling then. It could be good to test that it actually works with a binary outcome. There are other models that might be using it, but they should be able to still access it in the current version using the current release of JAS-mine.

I looked through the commits and it looks good to me, thank you for all the work on this. Should we try to merge this in and test if it works with LABSim?

vkhodygo commented 2 years ago

Hi @pbronka

Thanks for your feedback.

It could be good to test that it actually works with a binary outcome.

I can add the example used in the original paper to the tests to show that it works fine with binary systems. It must though, the difference between implementations was in variables only. The binary one used two doubles to store variables, whereas its multiple counterpart employed arrays for that.

Should we try to merge this in and test if it works with LABSim?

It's a bit too early, this version uses newer libraries, there are some API differences as well. I'm working on file export at the moment, the database module has to be rewritten from scratch I'm afraid. The package we use for this has been changed drastically, they explicitly state there is no way to migrate certain parts of the existing code.

I remember you had some database performance issues before. This is a good opportunity to fix this, could you remind me what the problem was originally?

pbronka commented 2 years ago

Hi @pbronka

Thanks for your feedback.

It could be good to test that it actually works with a binary outcome.

I can add the example used in the original paper to the tests to show that it works fine with binary systems. It must though, the difference between implementations was in variables only. The binary one used two doubles to store variables, whereas its multiple counterpart employed arrays for that.

I see. If it's not a problem, it does sound like a nice test case / usage example to me.

Should we try to merge this in and test if it works with LABSim?

It's a bit too early, this version uses newer libraries, there are some API differences as well. I'm working on file export at the moment, the database module has to be rewritten from scratch I'm afraid. The package we use for this has been changed drastically, they explicitly state there is no way to migrate certain parts of the existing code.

I remember you had some database performance issues before. This is a good opportunity to fix this, could you remind me what the problem was originally?

Could you tell me more about what you are working on in terms of the database? That sounds very interesting as the performance was not very good indeed. When the database is created in LABSim, this is done by SQLdataParser class, which parses CSV files from InitialPopulations and EUROMOD donor population to create database tables: Household, Benefit unit, Person for each possible starting year, and DonorHousehold and DonorPerson. (At the moment 23 tables). Changing connection settings for the database (in LABSimStart)(disabling log, increasing cache: LOG=0;CACHE_SIZE=2097152;AUTO_SERVER=TRUE) helped with performance, but rebuilding all database tables still takes upwards of 5 minutes, which seems a bit excessive. However, I'm not super familiar with SQL so it's possible that a) this is normal, b) this parser is inefficient and this is a LABSim and not JAS-mine issue.

vkhodygo commented 2 years ago

I work on the DatabaseUtils class and CSV writer as well. It looks like they're not related to the code you mentioned, but rather used elsewhere. I've checked it, it looks like a giant SQL query is being generated which is highly likely sub-optimal.

vkhodygo commented 2 years ago

I think my initial assumption was more or less correct. Column creation (altering) takes too much time. I thought about using ENUMs for this, but apparently this is not much better in terms of performance. As alternative, we should try reference tables and see how it goes.

vkhodygo commented 2 years ago

@pbronka

Could you share a basic data sample that you use to create a database? I plan to do some profiling, but I need some initial data to work with.

pbronka commented 2 years ago

I think my initial assumption was more or less correct. Column creation (altering) takes too much time. I thought about using ENUMs for this, but apparently this is not much better in terms of performance. As alternative, we should try reference tables and see how it goes.

Is this about the input database (the aforementioned SQLdataParser) or saving outputs?

@pbronka

Could you share a basic data sample that you use to create a database? I plan to do some profiling, but I need some initial data to work with.

This will be model dependent, I believe - assuming you are happy to work with LABSim, you can use SQLdataParser to create database tables from the CSV files that are in the private repo on development_UK-multirun_1.02/input branch. If you want a working copy of the database, you can unzip [input.h2.zip] and use that.

PS. Let me know how to profile performance of the database interactions. I had a go with a profiler, but found trying to profile SQL queries generated inside of LABSim / JAS-mine hard to work with. When I connect to the database without LABSim / JAS-mine running, using IntelliJ database manager, most operations are a matter of milliseconds. Which makes me think it's due to the old driver / connection setup.

vkhodygo commented 2 years ago

Is this about the input database (the aforementioned SQLdataParser) or saving outputs?

Input, but now I realize I might've mixed a few problems together. Anyway, I see that input SQL queries can be improved, even if to save some memory, a bit of time or make them more manageable.

The DatabaseUtils class must be rewritten from scratch as the API changes are too severe. I suppose this is the one responsible for data saving, correct me if I'm wrong.

model dependent

I don't really need that much, I've extracted one of the queries, so I need a heavy sample CSV to feed in to find the bottleneck.

pbronka commented 2 years ago

model dependent

I don't really need that much, I've extracted one of the queries, so I need a heavy sample CSV to feed in to find the bottleneck.

I could probably upload some CSV outputs for you, around 5GB. Will e-mail you the link.

vkhodygo commented 2 years ago

Thanks, I'll try it.

P.S. No worries, I'll keep it private as well.

vkhodygo commented 2 years ago

Hi @pbronka

I'd like to hear your thoughts regarding b0e9e74. Other commits do not change anything, just remove things that we don't really use/need.

Apologies, this PR turns into a small mess: I planned to introduce alignment changes only, but I had to wait for your review and I went astray.

P.S. Test coverage says it's somewhere around 70% at the moment. Most critical parts are covered, however, I can't properly test sections with file/directory permissions at the moment. Please also note that I replaced all exceptions with log messages. It's better to have your data unsaved and try to do this again than to stop the whole program.

pbronka commented 2 years ago

Hi @pbronka

I'd like to hear your thoughts regarding b0e9e74. Other commits do not change anything, just remove things that we don't really use/need.

Apologies, this PR turns into a small mess: I planned to introduce alignment changes only, but I had to wait for your review and I went astray.

P.S. Test coverage says it's somewhere around 70% at the moment. Most critical parts are covered, however, I can't properly test sections with file/directory permissions at the moment. Please also note that I replaced all exceptions with log messages. It's better to have your data unsaved and try to do this again than to stop the whole program.

Thanks @vkhodygo. Looks good to me, although I'm not sure about replacing all exceptions with log messages and continuing with the simulation. This is because I think the usual use case will be to generate data for further analysis and, especially if the simulation takes a long time to run, I would rather know immediately that a critical error occurred than wait until the end. So I would be curious to hear your thoughts on why it is always better not to stop the program.

vkhodygo commented 2 years ago

@pbronka That was probably my misunderstanding, but saving the data happens after the simulation is over, right? If there is something that prevents you from doing so, you can try and resolve this issue and reattempt saving again without loosing any data and running the simulation again. This works only in the case of data manipulation, early simulation termination is more beneficial here as you said.

pbronka commented 2 years ago

@pbronka That was probably my misunderstanding, but saving the data happens after the simulation is over, right?

That will be model specific and adjustable, but in LABSim data is saved at the end of every time period (year) (by the Collector).

By the way - this is done through the https://github.com/jasmineRepo/JAS-mine-core/blob/dev/5.0.0/microsim-core/src/main/java/microsim/data/DataExport.java class, so if ExportCSV is replaced with ExportToCSV that will have to be adjusted too.

vkhodygo commented 2 years ago

This is the latest building version and I think we should merge it rather than keeping in this state. Note that this is a WIP, I left some todo and fixme comments that will be moved to separate issues later. I also haven't changed back log events to exceptions yet. @pbronka if you think this is a good Idea, could you please open a separate issue for this to track it? Or you could go through the code later and mark the spots which are to be changed.

pbronka commented 2 years ago

Hi @vkhodygo thanks for this. Sounds good to me and we can go ahead and merge this PR. The only thing I'd like to point out is that I don't think we will be able to use it without further changes to the GUI due to class names being updated (I opened an issue to keep track of this here https://github.com/jasmineRepo/JAS-mine-gui/issues/5).