iryndin / jdbf

Java utility to read/write DBF files
108 stars 77 forks source link

MemoReader.read() method reading from wrong location #37

Open wescleveland56 opened 7 years ago

wescleveland56 commented 7 years ago

I have a Visual FoxPro dbf with fpt that results in the process being halted with no error message during MemoReader.read() processing as part of the DbfRecord.getMemoAsString() call. After some debugging, I have discovered that the InputStream.skip() call below was only skipping 8192 bytes instead of the number of bytes requested.

memoInputStream.skip(memoHeader.getBlockSize()*offsetInBlocks);

According to documentation at https://docs.oracle.com/javase/8/docs/api/java/io/InputStream.html#skip-long- the InputStream.skip() method may skip fewer bytes than requested:

Skips over and discards n bytes of data from this input stream. The skip method may, for a variety of reasons, end up skipping over some smaller number of bytes, possibly 0. This may result from any of a number of conditions; reaching end of file before n bytes have been skipped is only one possibility. The actual number of bytes skipped is returned. If n is negative, the skip method for class InputStream always returns 0, and no bytes are skipped. Subclasses may handle the negative value differently.

My workaround was to add a new method in IOUtils:

public static long inputStreamSkip( InputStream stream, long bytesToSkip ) throws IOException {
    long bytesSkipped = stream.skip( bytesToSkip );
    long totalBytesSkipped = bytesSkipped;
    while ( totalBytesSkipped < bytesToSkip && bytesSkipped > 0 ) {
        bytesSkipped = stream.skip( bytesToSkip - totalBytesSkipped );
        totalBytesSkipped += bytesSkipped;
    }
    return totalBytesSkipped;
}

and then replace the above MemoReader.read() line with:

IOUtils.inputStreamSkip( memoInputStream, memoHeader.getBlockSize()*offsetInBlocks );

I have also experienced similar behavior with the DbfReader.seek() method but didn't know why until now. I suspect that all InputStream.skip() calls could potentially suffer from this same issue.

jferard commented 7 years ago

Hello @wescleveland56. For a similar issue with InputStream.read and a similar fix, see this PR: https://github.com/iryndin/jdbf/pull/36. I'm still waiting... However, I think your method may introduce a subtle bug: bytesSkipped == 0 doesn't mean that we can't skip anymore bytes. Therefore, you have no guarantee that bytesToSkip were actually skipped after the method call. Thus, you still may read the wrong data. To fix this, maybe MemoReader should read the whole memo file once, and then store its data into a Memo object. Or MemoReader could use a RandomAccessFile. I don't know, but I'm pretty sure it would be better if we had something like that (no matter what is behind):

interface Memo {
    MemoRecord get(String fieldName); // uses memoInputStream data and metadata from dbfInputStream 
}
wescleveland56 commented 7 years ago

hi @jferard. Without bytesSkipped > 0 you risk an endless loop when bytesSkipped == 0 is continually returned. Perhaps the answer should be to thrown an exception after the while loop if totalBytesSkipped != bytesToSkip or let the caller check the returned value of totalBytesSkipped and if it's not what they expect, deal with it in their own way.

In my opinion, having MemoReader read the whole memo file once is not a viable solution due to the possibility of limited RAM and potentially many large memo files being opened at the same time.

jferard commented 7 years ago

Ok @wescleveland56, it's true that there is a risk of endless loop if you remove the condition bytesSkipped > 0. But if you leave it, there is a risk of reading the wrong data.

It's clear that the current poor man's seek method is not good, because:

  1. it may fail to reach the right position (see above)
  2. InputStream.reset may throw an IOException if the mark is invalid (more than 1024 kb read).
  3. it tries to store the memo file in memory: the memory used by the mark/reset method may grow to 1024 kb per file.

I'm not familiar with memo file sizes. Let's assume that we don't want to store memo files in memory. I think there is no reason to try to emulate a random access file with a sequential access file. All we need is there: https://docs.oracle.com/javase/7/docs/api/java/io/RandomAccessFile.html. For speed, one may use a cache.

wescleveland56 commented 7 years ago

@jferard I agree true random access would be the better approach. BTW, thanks to you and the rest of the jdbf development team for all your work on jdbf. It's the best Java DBase/FoxBase interface I've found.

jferard commented 7 years ago

Unfortunately, I'm not a member of the team. I agree that jdbf is a very good library, and I have some ideas to improve it, but @iryndin is the only master on board and he doesn't respond since January 29...