CsvRow returns incorrect starting offset for multibyte input

junit commented 2 years ago

Describe the bug random access by offset is incorrect. testfile: Item.csv runtime log:

CsvRow[originalLineNumber=1, startingOffset=0, fields=[JHXMMLCARMY0926DYG0111RL, 139707794, Women’s V Neck Nightshirt Cotton Casual Sleepwear Short Sleeve Nightgown S-XXL, ACTIVE, PUBLISHED, , Clothing, 16.32, USD, 16.32, 0.0, , 2038356, VALUE, 0.48, "LB", , Seller Fulfilled, , 2A5KQQ6BAE5S, 05432968344899, , http://www.walmart.com/ip/Women-s-V-Neck-Nightshirt-Cotton-Casual-Sleepwear-Short-Sleeve-Nightgown-S-XXL/139707794, https://i5.walmartimages.com/asr/f277aaf6-4bf0-4635-be9b-9ecf8826bbfa.c175ab2fc00cdfa902ee3408d6d4c586.jpeg, UNNAV, ["UNNAV"], Carlendan, 10/29/2021, 12/31/2049, 10/29/2021, 10/29/2021, 0, , Y, , , , ], comment=false]
CsvRow[originalLineNumber=1, startingOffset=0, fields=[, ], comment=false]

To Reproduce JUnit test to reproduce the behavior:

    private static void randomAccessFile() {
        try {

            final Path path = Paths.get(System.getProperty("user.dir") + "/data/Item.csv");

            // collect row offsets (could also be done in larger chunks)
            final List<Long> offsets;
            try (CsvReader csvReader = CsvReader.builder().build(path, UTF_8)) {
                offsets = csvReader.stream()
                        .map(CsvRow::getStartingOffset)
                        .collect(Collectors.toList());
            }

            // random access read with offset seeking
            try (RandomAccessFile raf = new RandomAccessFile(path.toFile(), "r");
                 FileInputStream fin = new FileInputStream(raf.getFD());
                 InputStreamReader isr = new InputStreamReader(fin, UTF_8);
                 CsvReader reader = CsvReader.builder().build(isr);
                 CloseableIterator<CsvRow> iterator = reader.iterator()) {

                // seek to file offset of row 5
                raf.seek(offsets.get(5));
                reader.resetBuffer();
                System.out.println(iterator.next());

                // seek to file offset of row 8
                raf.seek(offsets.get(8));
                reader.resetBuffer();
                System.out.println(iterator.next());
            }
        } catch (final IOException e) {
            throw new UncheckedIOException(e);
        }
    }

Additional context Java distribution and version to be used (output of java -version).

osiegmar commented 2 years ago

Thanks for your report. That's a tough one.

When I developed the support for random access file operations in 2.1.0 I didn't consider that there's a fundamental problem:

The method de.siegmar.fastcsv.reader.CsvRow#getStartingOffset returns the a character offset where java.io.RandomAccessFile#seek seeks to a byte offset.

When the file only uses single byte characters (basically ASCII) this isn't a problem. The file you're referring to has a three byte Unicode character 0xE28099 (\u2019 in java – the apostrophe in "Women’s") in position 3510 (byte offset). As a consequence FastCSV returns a character start offset of 4014 for the next (seventh) row but the byte offset is 4016.

As it is not possible to seek to a character offset, CsvRow would need to be changed in order to return a byte offset.

I'll think about this.

osiegmar commented 2 years ago

The feature added with #57 will be removed from the next release as the current approach is not suitable (as this issue revealed). It seems that quite a lot of work is needed to implement it in a working way. I will open a new ticket for that.

osiegmar commented 2 years ago

Just released as part of 2.2.0

osiegmar / FastCSV

CsvRow returns incorrect starting offset for multibyte input #59