Chronicle Map not thread-safe in certain circumstances

andycrutchlow commented 10 years ago

When concurrent threads reading/writing to DIFFERENT map entries through standard DataValueClass generated getter/setters (marshalling to corresponding UNSAFE methods) and if these map entries phyically reside in the SAME SEGMENT then there is possibility of corruption of data. Problem seems since version 3.0.2 collections i.e. I cannot reliably reproduce on that version..but I can pretty consistantly reproduce on latest 3.2.1 collections and 6.4.6 lang on two completely different platforms i.e. ubuntu 12.04 AMD Athlon(tm) 64 X2 Dual Core Processor 3600+ × 2 and on MacOSX 10.7.5 2.2GHz Core i7 using the same test program below : NOTE : ignore the junit test result .. its the logging that indicates the actual problem.

package com.nimrod.test;

import static org.junit.Assert.*;

import java.io.File; import java.io.IOException; import java.util.concurrent.CountDownLatch; import java.util.concurrent.Executors; import java.util.concurrent.ThreadPoolExecutor;

import net.openhft.collections.SharedHashMap; import net.openhft.collections.SharedHashMapBuilder; import net.openhft.lang.model.DataValueClasses;

import org.junit.After; import org.junit.Before; import org.junit.Test; import org.slf4j.Logger; import org.slf4j.LoggerFactory;

public class TestSharedHashMap { final static Logger logger = LoggerFactory.getLogger(TestSharedHashMap.class); SharedHashMap<String, TestDataValue> shm; TestDataValue data1 = DataValueClasses.newDirectReference(TestDataValue.class); TestDataValue data2 = DataValueClasses.newDirectReference(TestDataValue.class); TestDataValue data3 = DataValueClasses.newDirectReference(TestDataValue.class); TestDataValue data4 = DataValueClasses.newDirectReference(TestDataValue.class); TestDataValue data5 = DataValueClasses.newDirectReference(TestDataValue.class); TestDataValue data6 = DataValueClasses.newDirectReference(TestDataValue.class); TestDataValue data7 = DataValueClasses.newDirectReference(TestDataValue.class); TestDataValue data8 = DataValueClasses.newDirectReference(TestDataValue.class);

ThreadPoolExecutor threads = (ThreadPoolExecutor)Executors.newFixedThreadPool(2);

final CountDownLatch endGate = new CountDownLatch(2);

@Before
public void setUp() throws Exception {
    try {
        shm = new SharedHashMapBuilder()
        .generatedValueType(true)
        .entries(2048)
        .entrySize(256)
        .file(new File("/run/shm/TestOpenHFT")).kClass(String.class).vClass(TestDataValue.class).create();

//For older versions // shm = new SharedHashMapBuilder() // .generatedValueType(true) // //.entries(2048) // .entrySize(1024) // .create(new File("/run/shm/TestOpenHFT"),String.class,TestDataValue.class);

        //Map entries are loaded into various segments in mappeddatastore
        shm.acquireUsing("1111.2222.A0", data1);

        shm.acquireUsing("1111.2222.B0", data2);

        shm.acquireUsing("1111.2222.A1", data3);

        shm.acquireUsing("1111.2222.B1", data4);

        shm.acquireUsing("1111.2222.A2", data5);

        shm.acquireUsing("1111.2222.B2", data6);

        //data7 typically points to data in same segment as data1 points to i.e. segmentnum=0
        //To see segments i added line
        //LOG.info("key="+key+" segmentnum="+segmentNum);
        //in method : V lookupUsing(K key, V value, boolean create) in class VanillaSharedHashMap

        shm.acquireUsing("1111.2222.A3", data7);

        shm.acquireUsing("1111.2222.B3", data8);
    } catch(Exception e) {
        endGate.countDown();
        endGate.countDown();
        e.printStackTrace();
    }
}

@After
public void tearDown() {
    try {
        endGate.await();
        shm.close();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (InterruptedException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
}

@Test
public void test() {
    //Simple case just to prove all is OK
    data1.setLongString("AAAAAAAAAAAAAAAAAAAA");
    assertEquals("AAAAAAAAAAAAAAAAAAAA", data1.getLongString());
    data7.setLongString("BBBBBBBBBBBBBBBBBBBB");
    assertEquals("BBBBBBBBBBBBBBBBBBBB", data7.getLongString());

    logger.info("Start threads to test concurrent read/write in same segment .. this should usually fail");
    //..change data7 to another e.g. data2 and should work
    threads.execute(new TestTask(data1,"AAAAAAAAAAAAAAAAAAAA"));
    threads.execute(new TestTask(data7,"BBBBBBBBBBBBBBBBBBBB"));
}

class TestTask implements Runnable {
    TestDataValue data; 
    String msg;
    TestTask(TestDataValue data, String msg) {
        this.data = data;
        this.msg = msg;
    }
    public void run() {
        for(int i=0;i<100;i++) {
            data.setLongString(msg);
            logger.info("test "+i+" "+data.getLongString());
            try {
                assertEquals(msg, data.getLongString());
            } catch (AssertionError e) {
                logger.error("AssertionError",e);
                break;
            }
        }
        endGate.countDown();
    }

}

}

peter-lawrey commented 10 years ago

Once you have a reference to an off heap entry it works much the same as if you have an on heap object. The entry may have come from a thread safe collection but that doesn't make the entry also inherently thread safe. To make your code thread safe you need to add entry/record level locking. See IntValue or JavaBeanInterface for examples. One you have added a lock field, you need to busyLock () before accessing it and finally unlock () it. On 11/09/2014 8:04 PM, "andycrutchlow" notifications@github.com wrote:

When concurrent threads reading/writing to DIFFERENT map entries through standard DataValueClass generated getter/setters (marshalling to corresponding UNSAFE methods) and if these map entries phyically reside in the SAME SEGMENT then there is possibility of corruption of data. Problem seems since version 3.0.2 collections i.e. I cannot reliably reproduce on that version..but I can pretty consistantly reproduce on latest 3.2.1 collections and 6.4.6 lang on two completely different platforms i.e. ubuntu 12.04 AMD Athlon(tm) 64 X2 Dual Core Processor 3600+ × 2 and on MacOSX 10.7.5 2.2GHz Core i7 using the same test program below : NOTE : ignore the junit test result .. its the logging that indicates the actual problem.

package com.nimrod.test;

import static org.junit.Assert.*;

import java.io.File; import java.io.IOException; import java.util.concurrent.CountDownLatch; import java.util.concurrent.Executors; import java.util.concurrent.ThreadPoolExecutor;

import net.openhft.collections.SharedHashMap; import net.openhft.collections.SharedHashMapBuilder; import net.openhft.lang.model.DataValueClasses;

import org.junit.After; import org.junit.Before; import org.junit.Test; import org.slf4j.Logger; import org.slf4j.LoggerFactory;

public class TestSharedHashMap { final static Logger logger = LoggerFactory.getLogger(TestSharedHashMap.class); SharedHashMap shm; TestDataValue data1 = DataValueClasses.newDirectReference(TestDataValue.class); TestDataValue data2 = DataValueClasses.newDirectReference(TestDataValue.class); TestDataValue data3 = DataValueClasses.newDirectReference(TestDataValue.class); TestDataValue data4 = DataValueClasses.newDirectReference(TestDataValue.class); TestDataValue data5 = DataValueClasses.newDirectReference(TestDataValue.class); TestDataValue data6 = DataValueClasses.newDirectReference(TestDataValue.class); TestDataValue data7 = DataValueClasses.newDirectReference(TestDataValue.class); TestDataValue data8 = DataValueClasses.newDirectReference(TestDataValue.class);

ThreadPoolExecutor threads = (ThreadPoolExecutor)Executors.newFixedThreadPool(2);

final CountDownLatch endGate = new CountDownLatch(2);

@Before public void setUp() throws Exception { try { shm = new SharedHashMapBuilder() .generatedValueType(true) .entries(2048) .entrySize(256) .file(new File("/run/shm/TestOpenHFT")).kClass(String.class).vClass(TestDataValue.class).create();

//For older versions // shm = new SharedHashMapBuilder() // .generatedValueType(true) // //.entries(2048) // .entrySize(1024) // .create(new File("/run/shm/TestOpenHFT"),String.class,TestDataValue.class);
    //Map entries are loaded into various segments in mappeddatastore
    shm.acquireUsing("1111.2222.A0", data1);

    shm.acquireUsing("1111.2222.B0", data2);

    shm.acquireUsing("1111.2222.A1", data3);

    shm.acquireUsing("1111.2222.B1", data4);

    shm.acquireUsing("1111.2222.A2", data5);

    shm.acquireUsing("1111.2222.B2", data6);

    //data7 typically points to data in same segment as data1 points to i.e. segmentnum=0
    //To see segments i added line
    //LOG.info("key="+key+" segmentnum="+segmentNum);
    //in method : V lookupUsing(K key, V value, boolean create) in class VanillaSharedHashMap

    shm.acquireUsing("1111.2222.A3", data7);

    shm.acquireUsing("1111.2222.B3", data8);
} catch(Exception e) {
    endGate.countDown();
    endGate.countDown();
    e.printStackTrace();
}
}

@After public void tearDown() { try { endGate.await(); shm.close(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (InterruptedException e) { // TODO Auto-generated catch block e.printStackTrace(); } }

@Test public void test() { //Simple case just to prove all is OK data1.setLongString("AAAAAAAAAAAAAAAAAAAA"); assertEquals("AAAAAAAAAAAAAAAAAAAA", data1.getLongString()); data7.setLongString("BBBBBBBBBBBBBBBBBBBB"); assertEquals("BBBBBBBBBBBBBBBBBBBB", data7.getLongString());
logger.info("Start threads to test concurrent read/write in same segment .. this should usually fail");
//..change data7 to another e.g. data2 and should work
threads.execute(new TestTask(data1,"AAAAAAAAAAAAAAAAAAAA"));
threads.execute(new TestTask(data7,"BBBBBBBBBBBBBBBBBBBB"));
}

class TestTask implements Runnable { TestDataValue data; String msg; TestTask(TestDataValue data, String msg) { this.data = data; this.msg = msg; } public void run() { for(int i=0;i<100;i++) { data.setLongString(msg); logger.info("test "+i+" "+data.getLongString()); try { assertEquals(msg, data.getLongString()); } catch (AssertionError e) { logger.error("AssertionError",e); break; } } endGate.countDown(); }

}

}

— Reply to this email directly or view it on GitHub https://github.com/OpenHFT/HugeCollections/issues/43.

andycrutchlow commented 10 years ago

If the two threads where operating on the same off heap entry potentially concurrently then absolutely locking would be needed. But the case i describe above has two threads accessing/mutating two completely different off heap entries with two different physical memory locations. thread 1 has aquired entry referenced by key "1111.2222.A0" and is using datavalue class data1 which is pointing to it. Meanwhile thread 2 has aquired entry referenced by key "1111.2222.A3" and is using datavalue class data7 which is pointing to it. If i were to use buyLock on either/both i would be locking 2 completely different entries. The issue seems to be around segments. By virtue of key hashing data1 and data7 are pointing to sharedmemory locations in the same segment...all the other keys in the above test program result in entries in different segments and if the above test is run with any other than data7 it works as expected. To demonstrate change line : threads.execute(new TestTask(data7,"BBBBBBBBBBBBBBBBBBBB")); to be for data2 or data3 etc ... but NOT data1 as that would require locks.

peter-lawrey commented 10 years ago

In that case, it sounds like a bug or at least a usability issue. Entries in the same segment are consecutive in memory and it is possible to one to overwrite another if you write more data than you should. The protection around this is not ideal. I will investigate this today. On 12/09/2014 7:34 AM, "andycrutchlow" notifications@github.com wrote:

If the two threads where operating on the same off heap entry potentially concurrently then absolutely locking would be needed. But the case i describe above has two threads accessing/mutating two completely different off heap entries with two different physical memory locations. thread 1 has aquired entry referenced by key "1111.2222.A0" and is using datavalue class data1 which is pointing to it. Meanwhile thread 2 has aquired entry referenced by key "1111.2222.A3" and is using datavalue class data7 which is pointing to it. If i were to use buyLock on either/both i would be locking 2 completely different entries. The issue seems to be around segments. By virtue of key hashing data1 and data7 are pointing to sharedmemory locations in the same segment...all the other keys in the above test program result in entries in different segments and if the above test is run with any other than data7 it works as expected. To demonstrate change line : threads.execute(new TestTask(data7,"BBBBBBBBBBBBBBBBBBBB")); to be for data2 or data3 etc ... but NOT data1 as that would require locks.

— Reply to this email directly or view it on GitHub https://github.com/OpenHFT/HugeCollections/issues/43#issuecomment-55396288 .

andycrutchlow commented 10 years ago

I pretty sure i am not writing more data than i should ... the string field defined as max 256 ... i should have included the interface : package com.nimrod.test; import net.openhft.lang.model.constraints.MaxSize; public interface TestDataValue { void setLongString(@MaxSize(256) String longString); String getLongString(); } Also...as i mentioned, if I use an earlier version of collections and java-lang e.g. 3.0d and 6.3.3 respectively, its works as expected...so that would suggest something has changed.

RobAustin commented 10 years ago

We have raised the coresponsing JIRA - https://higherfrequencytrading.atlassian.net/browse/HCOLL-131

danielshaya commented 10 years ago

Hi Andy

Thanks for your feedback, we are very pleased you are using SharedHashMap and are grateful that you are contributing to our efforts to improve the quality of OpenHFT software.

It would enormously helpful to us if you could spend a few minutes answering the following questions:

We understand that people might have privacy concerns and that you might not want to answer all the questions. Please don't worry if that's the case. Also the more information we can gather the better but feel free to write one liners if you want!

What's the name of your company?
How many employees are there at the company?
Please describe the project in which OpenHFT software is being used and how it has helped.
Is it in evaluation or production?
Would you consider paying for a support model http://openhft.net/support/ which could include priority support, training, bespoke software changes.
How can we improve your experience with OpenHFT software?

Thank you

Daniel

On Fri, Sep 12, 2014 at 1:34 PM, andycrutchlow notifications@github.com wrote:

If the two threads where operating on the same off heap entry potentially concurrently then absolutely locking would be needed. But the case i describe above has two threads accessing/mutating two completely different off heap entries with two different physical memory locations. thread 1 has aquired entry referenced by key "1111.2222.A0" and is using datavalue class data1 which is pointing to it. Meanwhile thread 2 has aquired entry referenced by key "1111.2222.A3" and is using datavalue class data7 which is pointing to it. If i were to use buyLock on either/both i would be locking 2 completely different entries. The issue seems to be around segments. By virtue of key hashing data1 and data7 are pointing to sharedmemory locations in the same segment...all the other keys in the above test program result in entries in different segments and if the above test is run with any other than data7 it works as expected. To demonstrate change line : threads.execute(new TestTask(data7,"BBBBBBBBBBBBBBBBBBBB")); to be for data2 or data3 etc ... but NOT data1 as that would require locks.

— Reply to this email directly or view it on GitHub https://github.com/OpenHFT/HugeCollections/issues/43#issuecomment-55396288 .

andycrutchlow commented 10 years ago

happy to provide more info...but not on an issue thread. contact me directly through my email instead.

danielshaya commented 10 years ago

Of course - please send your preferred contact address to daniel.shaya@higherfrequencytrading.com.

Regards

On Fri, Sep 12, 2014 at 4:05 PM, andycrutchlow notifications@github.com wrote:

happy to provide more info...but not on an issue thread. contact me directly through my email instead.

— Reply to this email directly or view it on GitHub https://github.com/OpenHFT/HugeCollections/issues/43#issuecomment-55415839 .

peter-lawrey commented 10 years ago

There was a thread safety issue around both readUTF and writeUTF as you suspected. I have added a test here

https://github.com/OpenHFT/HugeCollections/tree/master/collections/src/test/java/net/openhft/collections/dvgthreadbug

You can see I have included the generated code for the interface. You can dump this generated code with

-Ddvg.dumpcode=true

Can you confirm that the latest snap shot fixes this issue?

andycrutchlow commented 10 years ago

got latest snapshot and yes I can confirm the fix has addressed the issue. thanks for fast turnaround.

peter-lawrey commented 10 years ago

Having a good unit test really helps.

On 12 September 2014 22:29, andycrutchlow notifications@github.com wrote:

got latest snapshot and yes I can confirm the fix has addressed the issue. thanks for fast turnaround.

— Reply to this email directly or view it on GitHub https://github.com/OpenHFT/HugeCollections/issues/43#issuecomment-55463241 .

peter-lawrey commented 10 years ago

Note: I added to the test a check that one million get/sets for each thread produced less then 2 MB of garbage.

On 12 September 2014 22:32, Peter Lawrey peter.lawrey@gmail.com wrote:

Having a good unit test really helps.

On 12 September 2014 22:29, andycrutchlow notifications@github.com wrote:

got latest snapshot and yes I can confirm the fix has addressed the issue. thanks for fast turnaround.

— Reply to this email directly or view it on GitHub https://github.com/OpenHFT/HugeCollections/issues/43#issuecomment-55463241 .

andycrutchlow commented 10 years ago

nice..and improved a few other things in the test code also i see.

On Fri, Sep 12, 2014 at 5:33 PM, Peter Lawrey notifications@github.com wrote:

Note: I added to the test a check that one million get/sets for each thread produced less then 2 MB of garbage.

On 12 September 2014 22:32, Peter Lawrey peter.lawrey@gmail.com wrote:

Having a good unit test really helps.

On 12 September 2014 22:29, andycrutchlow notifications@github.com wrote:

got latest snapshot and yes I can confirm the fix has addressed the issue. thanks for fast turnaround.

— Reply to this email directly or view it on GitHub < https://github.com/OpenHFT/HugeCollections/issues/43#issuecomment-55463241>

.

— Reply to this email directly or view it on GitHub https://github.com/OpenHFT/HugeCollections/issues/43#issuecomment-55463650 .

peter-lawrey commented 10 years ago

Note, if you dump the code and include it in your source, it is easier to debug.

peter-lawrey / HugeCollections-OLD

Chronicle Map not thread-safe in certain circumstances #43