imixs / imixs-workflow

The open source technology for business process management
http://www.imixs.org/
GNU General Public License v3.0
351 stars 67 forks source link

Performance improvements #697

Open rsoika opened 4 years ago

rsoika commented 4 years ago

The deepCopy method from the ItemCollection is the most expensive method concerning overall performance. We can not optimize or improve this method further, but we can avoid a expensive deepCopy in many cases by replaceing

new ItemCollection(_task)

with

ItemCollection.createByReference(_task.getAllItems());

Candiates for this performance improvements are:

The method createByReference

    public static ItemCollection createByReference(final Map<String, List<Object>> map) {
        ItemCollection reference = new ItemCollection();
        if (map != null) {
            reference.hash = map;
        }
        return reference;
    }

should be replaced with something like:

        Map<Object, Object> mapCopy =  map
        .entrySet()
        .stream()
        .collect(Collectors.toMap(Map.Entry::getKey, valueMapper -> new ArrayList<>(valueMapper.getValue())));
bfsc commented 4 years ago

Hi @rsoika

I have faced similiar performance problems related to deep clones, specially when using a serialization approach, which is the most expensive way to do it.

I published a research paper in a renowed academic conference (ACM/IEEE International Symposium of Empirical Software Engineering and Measurament) presenting a new way to deal with object clones in java. I called it Lazy Clone. It clones the graph of objects on demand, instead of cloning it all as deep clone strategies do. This strategy has improved the performance of an entire real world system in more than 80%.

Following is the pdf of the pre-print of the research paper entitled as "Improving performance and maintainability of object cloning with lazy clones: An empirical evaluation": https://bit.ly/3aVm7lh

One of the know problems of the lazy clone approach I proposed occurs when the original object is mutable. So, it is a tradeoff one has to take into consideration when adopting a lazy clone approach.

I would be happy to help you with this, and maybe do a pull request. Please let me know if you consider such kind of contribution relevant to this project.

Cheers, Bruno Cartaxo

rsoika commented 4 years ago

Hi Bruno,

yes of course this sounds interesting.

The reason why I came up to the deepCopy approach was initially the JPA part of the Imixs-Workflow engine.

We are dealing with 'gerneric value objects' called ItemCollection. An ItemCollection contains a Java HashMap storing only serializable objects. The HashMap may contain primitives (copied by value) but also things like byte arrays or even more complex serializable structures.

Normally you would expect that a JPA call manager.detach(myObject) would be fine. But the workflow engine has to deal with situations, where a business object is part of a complex processing life-cycle. Within one transaction the object is loaded, saved, loaded saved, loaded .... In this situation only a deep copy ensures that the object holds no reference to unsubmitted data Issue 230.

Ok - this was the initial part of the problem. I don't think that the serializion of the HashMap in this special situation of long running JPA transaction is a problem. We need a full copy so I do not expect a performance boost with a lazy loading approach. And to be honest, this one is a very tricky part within the workflow engine where I don't want to make any big changes.

But the reason for this issue (#697) is that the DeepCopy method has quietly sneaked into various other parts of the program, in which a LazyLoading would actually be an elegant and certainly high-performance solution.

So my idea was in deed to provide some new kind of a copy method. Your LazyLoading seems what is needed here!.

As far as I understand your approach you are working with reflection to detect getter methods starting with 'get' or 'is'. The ItemCollection itself has a lot of such methods. So I fear this would generate a lot of work. But maybe it is possible to just observe the getItemValue method calling the method 'hash.get(..)' method inside?

The current implementation of my deepCopy looks like this:

    private Object deepCopyOfMap(Map<String, List<Object>> map) {
        try {
            ByteArrayOutputStream bos = new ByteArrayOutputStream();
            ObjectOutputStream oos = new ObjectOutputStream(bos);
            // serialize and pass the object
            oos.writeObject(map);
            oos.flush();
            ByteArrayInputStream bais = new ByteArrayInputStream(bos.toByteArray());
            ObjectInputStream ois = new ObjectInputStream(bais);
            return ois.readObject();
        } catch (IOException e) {
            logger.warning("Unable to clone values of ItemCollection - " + e);
            return null;
        } catch (ClassNotFoundException e) {
            logger.warning("Unable to clone values of ItemCollection - " + e);
            return null;
        }
    }

What I did not understand so far is: would you lazyLoad the ItemCollection or just the HashMap inside the ItemCollection?

Do you think a 'lazyCopyOfMap' would be possible?

bfsc commented 4 years ago

to be honest, this one is a very tricky part within the workflow engine where I don't want to make any big changes.

fair enough, specially if ou don't have performance problems in that situation.

the DeepCopy method has quietly sneaked into various other parts of the program, in which a LazyLoading would actually be an elegant and certainly high-performance solution.

that's it! that's what happened in the project i did a performance consulting job. the deep clone method was initially used to solve a small and isolated problem, but after 5 years developing that software it was spread all over the system provoking numerous performance issues.

The current implementation of my deepCopy looks like this:

i see. this is the most common way to make a deep clone using a serialization approach. convinient and stratighforward. unfortunatelly it is very expensive. on the article i sent you i analyzed five different implementations of clones. three deep and two lazy clone implementations . you can try another deep clone implementation that is easy to do, but according to my bechmarks it is just slightly faster that the serialization implementation. the reflection implementation. at the time i wrote the article i used the following library https://github.com/kostaskougios/cloning

What I did not understand so far is: would you lazyLoad the ItemCollection or just the HashMap inside the ItemCollection?

it is up to the developer. with a lazy clone implementation you can choose wich object you would like to lazily clone. but once you make a lazy clone of it, all the fields of that object that are neither primitive nor immutable will be lazily cloned, just when someone try to directly access that field (for a aspect oriented implementation) or when trying to access that field through a get method (dynamic proxy implementation).

Do you think a 'lazyCopyOfMap' would be possible?

with the lazy clone implementation with dynamic proxy (which is not as fast as the aspect one) it is possible. however, with the lazy clone implementation with aspect it would be a mere deep clone due to some limitations of my firt implementation.

anyway... i will try to devote some time to implement a lazy clone library with the two implementations (with dynamic proxies and aspects) since i can see there are many projects facing problems related to low maintainability and performance due to deep clones. when i have it done i come back here and share with you so we can test in a non-critical part of this project. what do you think?

best, bruno cartaxo

rsoika commented 4 years ago

Thanks for these insights. Yes it sounds interesting. I have now the following vision for the new implementation of the ItemCollection Class:

Only in the clone() method performs a full clone using serialization. In the critical part of the Imixs-Workflow engine I can call explicitly the clone() method to get a full deep copy, which is fine.

But the constructor method:

ItemCollection(Map<String, List<Object>> map)

will implement the faster lazy lading approach as you mentioned. This will will be a clear concept for developers. The most code fragments are simply using the constructor with the map parameter. So I expect a performance boost over all.

But still I wonder what happens in the following scenario:

1) A client is using the lazy loading method,

    ItemCollection myNewObject = new ItemCollection(someHashMap);

all primitives are copied by value but things like byte arrays or embedded lists are ignored.

2) This may be fine for the client because he just want to set some new values:

    myNewObject.setItemValue("someFieldName","... and some String...");

3) Now finally the client want to save this object by calling the EJB/JPA method:

    // call EJB method....
    documentService.save(myNewObject);

The save method of the DocumentService EJB is currently doing the following:

    // perform a deep copy of the given object (still not full loaded because of lazy loading)
    ItemCollection clone = (ItemCollection) document.clone(); 
    // copy the map into a JPA entity to be persisted by the JPA mechanism
        jpaObject.setData(clone.getAllItems()); 

I expect that here some extra work is needed, as the hashMap need to be fully loaded before I can clone it? Otherwise not all fields of my hashmap will be persisted which is not what the client is expecting.

What do you think about this scenario? - btw, I invited you to join the project.

bfsc commented 4 years ago

hey @rsoika sorry for the late reply, last days were quite busy.

I expect that here some extra work is needed, as the hashMap need to be fully loaded before I can clone it? Otherwise not all fields of my hashmap will be persisted which is not what the client is expecting.

that's a good question! if your clone method access the fields of your lazily clone object the lazy clone would work as expected by cloning each field on demand. but since your clone implementation is serialization based we have to check how it would behave.

as I told you, I'm trying to find time to work on a library to implement the lazy clone approach an make it availbale to everyone who want to use it. but it may take sometime until i have it fully implemented due to lake of time right now. so, as soon as a have it done i come back here for sure and share it with you.

btw... thank you for invite me to joi the project. it is a pleasure and i hope i can contribute as soon as i have that implementation.