osiegmar / FastCSV

CSV library for Java that is fast, RFC-compliant and dependency-free.
https://fastcsv.org/
MIT License
542 stars 93 forks source link

When try use CsvReader<CsvRecord> records second time in separate method then records is empty #106

Closed dbulahov closed 7 months ago

dbulahov commented 7 months ago

Please see example https://github.com/osiegmar/FastCSV/pull/105

osiegmar commented 7 months ago

Your example demonstrates the following:

try (CsvReader<CsvRecord> csv = CsvReader.builder().ofCsvRecord(file)) {
    // first loop
    for (CsvRecord r : csv) {
        System.out.println(r);
    }

    // second loop
    for (CsvRecord r : csv) {
        System.out.println(r);
    }
}

The used for-each loop is a short version of the following:

try (CsvReader<CsvRecord> csv = CsvReader.builder().ofCsvRecord(file)) {
    Iterator<CsvRecord> csv = csv.iterator();

    // first loop
    while (iterator.hasNext()) {
        System.out.println(iterator.next());
    }

    // second loop
    while (iterator.hasNext()) {
        System.out.println(iterator.next());
    }
}

The first loop displays all records, but the second loop is empty. This is expected, and FastCSV evidently requires additional documentation in this regard.

The call to CsvReader.builder().ofCsvRecord(file) returns an Iterable of CsvRecord (CsvReader). With each iteration of the loop, you read (consume) one record from the file. After all records are read, there are no more records.

If you truly want to store all records of the file for repeated access, you'd have to collect them:

try (CsvReader<CsvRecord> csv = CsvReader.builder().ofCsvRecord(file)) {
    final List<CsvRecord> records = csv.stream().toList();

    // first loop
    for (CsvRecord r : records) {
        System.out.println(r);
    }

    // second loop
    for (CsvRecord r : records) {
        System.out.println(r);
    }
}

Keep in mind that with this approach, all records of the file have to be kept in memory, while the streaming approach of FastCSV allows you to read huge CSV files with only a little memory consumption.

stbischof commented 7 months ago

When i remember currently the Usercase is to handle csv with typed columns.

col1,col2 String,int Foo,1 Bar,987 ...

there you need information from header and line 2.

Then run fast over all the following lines. What woulr be fastest way to handle this?

osiegmar commented 7 months ago

The fastest way would be a custom callback handler – see this example. But I wouldn't recommend that way unless you absolutely need the last bit of performance.

A very fast and maintainable way would be something like this:

private static void readFile(Path file) throws IOException {
    // Make sure we have the same number of fields in each record
    CsvReader.CsvReaderBuilder builder = CsvReader.builder()
        .ignoreDifferentFieldCount(false);

    try (CloseableIterator<CsvRecord> it = builder.ofCsvRecord(file).iterator()) {
        if (!it.hasNext()) {
            throw new IllegalStateException("No header found");
        }
        List<String> header = it.next().getFields();

        if (!it.hasNext()) {
            throw new IllegalStateException("No data types found");
        }
        List<String> dataTypes = it.next().getFields();

        while (it.hasNext()) {
            YourCustomRecord yourCustomRecord = mapRecord(header, dataTypes, it.next());

            // do something with your record
            System.out.println(yourCustomRecord);
        }
    }
}

private static YourCustomRecord mapRecord(List<String> header, List<String> dataTypes, CsvRecord record) {
    // Map the record to your needs with the given header and data types
    return new YourCustomRecord(record.getField(0), Integer.parseInt(record.getField(1)));
}

private record YourCustomRecord(String s, int i) {
}

You may also use NamedCsvRecords (.ofNamedCsvRecord(file)) if you want to access the fields by name. The efficiency of your mapRecord implementation will crucial to the performance of the whole process. But that highly depends on your exact needs.

In the future, with Stream Gatherers (JEP 461, a Preview Feature of Java 22) we'll also be able to implement this using Java streams. The following codes uses build 22-ea+31-2314 of OpenJDK:

private static void readFile(Path file) throws IOException {
    // Make sure we have the same number of fields in each record
    CsvReader.CsvReaderBuilder builder = CsvReader.builder()
        .ignoreDifferentFieldCount(false);

    try (Stream<CsvRecord> stream = builder.ofCsvRecord(file).stream()) {
        stream
            .gather(Gatherer.ofSequential(new YourCustomIntegrator()))
            .forEach(yourCustomRecord -> {
                // do something with your record
                System.out.println(yourCustomRecord);
            });
    }
}

private static class YourCustomIntegrator implements Gatherer.Integrator<Void, CsvRecord, YourCustomRecord> {

    private List<String> header;
    private List<String> dataTypes;

    @Override
    public boolean integrate(Void state, CsvRecord element,
                             Gatherer.Downstream<? super YourCustomRecord> downstream) {
        if (dataTypes != null) {
            return downstream.push(mapRecord(header, dataTypes, element));
        }

        if (header == null) {
            header = element.getFields();
        } else {
            dataTypes = element.getFields();
        }
        return true;
    }

}
dbulahov commented 7 months ago

Thank you very much for your help