peter-lawrey / Java-Chronicle

Java Indexed Record Chronicle
1.22k stars 193 forks source link

Subsequent writes append null characters to previous record #12

Closed jknehr closed 11 years ago

jknehr commented 12 years ago

I have not had a significant amount of time to look further into why this is happening, but if you run the following example I quickly put together together, you should see that one of the read messages will have extra bytes padding the end of it. This doesn't happen until after a subsequent write.

So, for example, if the below class tells you that the data at index 7 does not match what was expected, it only became that way after data at index 8 was written. If you were to write the first 8 byte arrays (instead of the 10 I am showing below), you will see that there are no errors. It isn't until the 9th one is written that it appends the null bytes to the 8th record.

It seems this pattern repeats every 8 records.

I'll see if I can provide more information when I get some more time.

Thanks.

package vanilla.java.chronicle.impl;

import org.junit.Test;
import vanilla.java.chronicle.Excerpt;

import java.io.IOException;
import java.util.ArrayList;
import java.util.Random;

public class WritePlacementTest {

    private IndexedChronicle chronicle;
    private Random random;
    private ArrayList<byte[]> msgs;

    public WritePlacementTest() throws IOException {
        chronicle = new IndexedChronicle(System.getProperty("java.io.tmpdir"), 12);
        chronicle.useUnsafe(false);
        chronicle.clear();

        random = new Random();
        random.setSeed(System.currentTimeMillis());

        msgs = new ArrayList<byte[]>();
    } // end constructor

    @Test
    public void WriteReadTest() throws Exception {

        int i;

        // make random byte arrays
        for(i = 0; i < 10; i++) {
            byte[] b= new byte[500];
            random.nextBytes(b);
            msgs.add(b);
        }

        Excerpt<IndexedChronicle> excerpt = chronicle.createExcerpt();

        // write msgs
        for(i = 0; i < msgs.size(); i++) {
            excerpt.startExcerpt(msgs.get(i).length);
            excerpt.write(msgs.get(i));
            excerpt.finish();
        }

        // read back msgs
        i = 0;
       while(excerpt.index(i) && i < msgs.size()) {
           byte[] read = new byte[excerpt.remaining()];

           excerpt.readFully(read);

           if(read.length != msgs.get(i).length) {
               System.out.printf("The lengths do not match! Index = %d\n", i);
               System.out.printf("\t[EXPECTED] %s\n", byteArrayToHexString(msgs.get(i)));
               System.out.printf("\t[READ]     %s\n", byteArrayToHexString(read));
           }

           i++;
       }
    }

    public static String byteArrayToHexString(byte[] b) {
        StringBuffer sb = new StringBuffer(b.length * 2);
        for (int i = 0; i < b.length; i++) {
            int v = b[i] & 0xff;
            if (v < 16) {
                sb.append('0');
            }
            sb.append(Integer.toHexString(v));
        }
        return sb.toString().toUpperCase();
    }

} // end class WritePlacementTest
peter-lawrey commented 12 years ago

An excellent demonstration of the problem.

This is more an undocumented "feature" than a bug. When you set the data size bits, you specify how much big (as a power of 2) the block or file of data should be.

So that your data can be continous (and all in one memory mapping) there must be at least the capacity of data you ask for free or the next block/file will be used. (It doesn't know how much you will ask for until you write the next record which is why it changes when you ask) This results in the last entry in each block/file being padded with zeros which you can usually ignore. Actually it skips the data rather than writing to it. If you can't ignore the padded zeros, you will need to record the length you really wanted as well. (The library could do this for you but it would be inefficent to do this if you didn't need it)

In your case the block/file size is 1 << 12 or 4096 bytes which will fit 8 * 500 records, but when you try to add a 9th 500 byte record, the previous record is extended to the end of the block/file so the record can start with the next block/file.

You can make the problem "go away" but setting the file size to be 30 bits (assuming you don't want to write more than 1 GB) This won't be as inefficient as it sounds as it only populates the regions of the file you touch. It will appears to be 1 GB, but won't actually use that much space.

Obvious there is a limitation that you can't have a single record which is larger than the block/file size. The maximum block/file size is 1 GB which is also the maximum record size. (This is due to a limitation in Java as the largest power of 2 memory mapped block size is 1 GB)

jknehr commented 12 years ago

Great explanation, thank you very much for the prompt reply!

I went with your proposed solution of storing the length since I expect journal sizes greater than 1GB.

For others reading, it goes like this...

Write operation:

excerpt.startExcerpt(msgs.get(i).length + (Integer.SIZE/8));
excerpt.writeInt(msgs.get(i).length);
excerpt.write(msgs.get(i));
excerpt.finish();

And then the read operation would look something like this:

excerpt.index(i);
int length = excerpt.readInt();
byte[] read = new byte[length];
excerpt.readFully(read);

Thanks again

peter-lawrey commented 11 years ago

For now, this is still the best work around.