osiegmar / FastCSV

CSV library for Java that is fast, RFC-compliant and dependency-free.
https://fastcsv.org/
MIT License
542 stars 93 forks source link

Question about parsing random strings #57

Closed wise-coders closed 2 years ago

wise-coders commented 2 years ago

I want to use the FastCSV for implementing a large file CSV editor, using JavaFX. I plan to parse only separate rows, at render time. This means I need to use the CSV parser to parse a String line at a time. Is it possible to have something like this?

CSVParser parser = new CSVParser();
parse.parse( line98);
parser.reset();
parser.parse(line99)

I mean I parse a line, then I may parse another, etc. For getting optimal memory usage, I would instantiate the CSVParser only one time.

osiegmar commented 2 years ago

Reading single lines can be simply done by:

Iterator<CsvRow> csv = CsvReader.builder().build(reader).iterator();

CsvRow row1 = csv.next();
CsvRow row2 = csv.next();

Skipping lines could be done by:

for (int i = 0; i < 30 && csv.hasNext(); i++) {
    csv.next();
}

(remember: the only way to know when a CSV record ends is by actually parsing it)

Although, FastCSV does not offer a way to reset the iterator (and internal buffer) in order to allow "scrolling" back to previous lines. I guess, this is what you are looking for, don't you?

wise-coders commented 2 years ago

Thank you for your answer. Unfortunately, my situation is more complex. I store the CSV lines in an ArrayList. These are passed to a JavaFX TableView. The TableView is using TableCells, which are getting their content by parsing one line. The user may scroll backward as well as forward. Because of memory optimizations, I parse only the visible lines. Think I have 5mio lines, only 20 are visible. So always I parse only the visible lines. Therefore I need a way to parse input from a String and not Reader. I would create one single class, if possible, then pass individual Strings to be parsed. Would be this possible?

osiegmar commented 2 years ago

Because of memory optimizations, I parse only the visible lines.

As CSV can contain line breaks within fields, how do you know where you need to start parsing "a line"? Before you parse CSV data you don't know where a line starts and ends. If you have already parsed data, why you want to read from a string (which is not very efficient from a memory perspective).

I think I'm still missing a piece of that puzzle :)

Therefore I need a way to parse input from a String and not Reader

To parse data from a String you could use the java.io.StringReader. But creating new CsvReader instances for every String is indeed very inefficient.

Currently FastCSV does not seem to suit your needs. I can't promise anything but I'd like to get a better understanding of your needs.

Do you first parse the entire file to get the total number of records? Do you memorize the line starting offsets in order to scroll back correctly (like "line 12345 starts at file offset 8372912")?

wise-coders commented 2 years ago

I thought in CSV each line is a record, and if some data contains a new line, this is escaped (\n, \r, ). Probably this is wrong? Please let me know. My intention was to create as few as possible objects out of a CSV file. Consider 1 file with 5mio records, total size 500M. I would like to load this in Java in 5mio Strings ( 1 String per record ), which adds an 8bytes X 5mio = 50Mbytes more memory, which is reasonable. If I would store each individual field, and each record has 40 columns, let's say, it would mean 50 X 40 Mbytes = 2Gbytes, only because I create a String for each field, which is too much. I want to avoid this. So the optimal in my case is to somehow create a list of records, each record as one String. When a record is displayed, it gets parsed again and split into fields. This would be the optimal memory usage.

osiegmar commented 2 years ago

Thanks for sharing this.

In CSV a record can span multiple lines by simply enclosing line breaks using double-quotes. See section 2.6 of RFC 4180. In order to know where a record starts and ends, one has to read and parse the entire record. Line breaks (and other characters) have different meanings depending on where they appear.

Talking about (memory) efficiency, a String is probably the worst thing one can do in order to parse a fair amount of data. This is because of the internal memory structure of a String (which is an immutable representation of a char/byte array). If you have a String that contains all data of one record and the record spans 40 columns, you end up having all the data in memory twice (one large string plus 40 column strings). There is nothing particularly wrong with the 40 column strings, but the large one is superfluous.

BTW, your calculation "8bytes X 5mio = 50 Mbytes" is wrong. A String object has a much higher memory footprint than 8 bytes (roughly 4-8 times higher). See this posting for example. Anyway, that should be the concern of the garbage collector. I wouldn't be afraid of a few million short-time living objects in your JavaFX application but I'd try to avoid large and redundant Strings!

Having said all that, I think you need to know:

If the user is on position 12345 you lookup your offset list where the CSV data offset starts for that scroll position. Then you could read the record from the file (by seeking to the offset (random access) and reading that chunk of data) and actually materializing Strings that are needed to display the columns. If the file has much more than 5 mio records you might optimize this in order to use data offset chunks (1 record starts at X, 100st record starts at Y, 200st ...).

To read the record at scroll position 12345 using FastCSV could hypothetical look like this:

List<Integer> dataOffsets = ...; // <-- 1st problem
int scrollPosition = 12345;

RandomAccessFile raf = new RandomAccessFile("file.csv", "r");
raf.seek(dataOffsets.get(scrollPosition));
Reader reader = new FileReader(raf.getFD());

CsvRow row = CsvReader.builder().build(reader).iterator().next(); // <-- 2nd problem

As I said before, FastCSV does currently not provide a way to know the data offsets (only the line number). Plus creating a new CsvReader instance for every scroll is also pretty inefficient (but probably negligible, as scroll events don't happen thousands of times).

osiegmar commented 2 years ago

You could give branch feature/data-offset a try. It offers de.siegmar.fastcsv.reader.CsvRow#getStartingOffset to obtain the data offset. But there's still no way to re-use the CsvReader instance.

wise-coders commented 2 years ago

Thank you very much for your reply. I find it very interesting.

You are right about the double space used by String, but this happens only for the visible records, which is not much. The nice stuff in JavaFX is that the visible cells get reused, so the duplicates will be soon released and garbage collected.

I already did an implementation of the editor, I can share few screenshots. For now, using the apache CSV parser, and considering each line of text a String, which is loaded in an ArrayList. I found a 600Mb file CSV file image and I open it in the CSV Editor. image

The memory usage is like 1.2Gbyte.

image

It is not perfect, but from my searches, I couldn't find other large file CSV editors. From the developer's perspective, is very convenient to work with a list of records. For each operation, I parse the record, and if there are modifications I compose the record back. I agree that I cannot use the FastCSV for this, but I can give it a try to write a parser only for Strings. I saw the method RowHander.consume() in your code, so probably I have to write something similar. I will need to :

  1. Split the file into records. Probably I can use the FastCSV
  2. Have a method public String[] CSVParser.parse( String record )
  3. Write back the record using String CSVWriter.write( String[] fields )

One possible approach may be to work with a byte[][], where a byte[] is a record, and by each operation to transform the record from byte[] to String. Maybe this is the best memory optimal option, I think.

The idea you have, to operate directly on the File is nice. But consider the user wants to add or drop a column. Then I have to do a copy of the file, make the changes there, and when the user says 'Save' to overwrite the initial file. This would be for memory usage the best option, I agree. I will think about this, thank you also for the branch feature/data-offset.

wise-coders commented 2 years ago

I changed the code, and using List<byte[]> the memory usage drops up to 860Mb.

osiegmar commented 2 years ago

I now also implemented a resetBuffer method in order to fully support this use case. See blackbox.reader.CsvReaderTest#seekingOffset for reference.

Basically it's like this:

Path testFile = tmpDir.resolve("test.csv");

// fetch the rows offset
List<Long> offsets = CsvReader.builder().build(testFile, UTF_8).stream()
    .map(CsvRow::getStartingOffset)
    .collect(Collectors.toList());

// parse the file via random access
try (RandomAccessFile raf = new RandomAccessFile(testFile.toFile(), "r");
     Reader fileReader = new InputStreamReader(new FileInputStream(raf.getFD()), UTF_8);
     CsvReader csvReader = CsvReader.builder().build(fileReader)) {

    Iterator<CsvRow> it = csvReader.iterator();

    int scrollPosition = 1000;

    raf.seek(offsets.get(scrollPosition));

    // fetch 10 rows at this position
    for (int i = 0; i < 10; i++) {
        CsvRow row = it.next();
    }

    // scroll event
    scrollPosition = 0;

    csvReader.resetBuffer();
    raf.seek(offsets.get(scrollPosition));

    // fetch 10 rows at this position
    for (int i = 0; i < 10; i++) {
        CsvRow row = it.next();
    }
}

This would now allow you to open virtually unlimited sized CSV files within your JavaFX application retaining small memory consumption.

Fortunately this was possible without sacrificing any performance of FastCSV.

dprutean commented 2 years ago

Great, thank you very much. I will add this to my code during the next few days.

dprutean commented 2 years ago

I searched the changes in the Head, but I couldn't find them. Have you pushed the changes?

osiegmar commented 2 years ago

The commits are referenced in this issue. You'll find both in the feature/data-offset branch.

dprutean commented 2 years ago

Great. Thank you very much. I got it working. For now, I am parsing the file, split in records, and save them as List<byte[]>. I think the approach with accessing the File is more difficult when will be about editing records and filtering. I would implement it now in memory.

Here few requreiements:

  1. Is it possible to add a CSVFormat with some pre-defined formats, like in Apache Parser ?
  2. I use some code to parse the input file, then collect the records and write them back as byte[]. For this I need a Writer, and to reduce the number of created objects I created my own StringWriter (same as the Java one, but with a method clear()). It would be great if having a CsvRow and CsvWriter it would be possible to simply write the record to a String.
public class MyStringWriter extends Writer {

    private StringBuffer buf;
...
    public void clear(){
        buf.delete( 0, buf.length()-1);
    }
}
  1. I use this code to load the file into the List<byte[]>. The Java memory usage after running this jumps to 1.5Gb, and after a while goes back to 0.8Gb. I think running this code there are still many temporary objects created. Could you please check this? Maybe I should use streaming, but I don't know how to do it here.
final BufferedReader rd = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
            final CsvReader reader = CsvReader.builder()
                    .fieldSeparator(',')
                    .quoteCharacter('"')
                    .commentStrategy(CommentStrategy.SKIP)
                    .commentCharacter('#')
                    .skipEmptyRows(true)
                    .errorOnDifferentFieldCount(false).build(rd );
            final MyStringWriter sw = new MyStringWriter();
            final CsvWriter writer = CsvWriter.builder().fieldSeparator(',').quoteCharacter('"').build(sw);
            boolean isHeader = true;
            for ( Iterator<CsvRow> itr = reader.iterator(); itr.hasNext(); ){
                CsvRow row = itr.next();
                if ( isHeader ) {
                    columnNames.addAll( row.getFields());
                    isHeader = false;
                } else {
                    writer.writeRow(row.getFields());
                    lines.add(sw.toString().getBytes());
                    sw.clear();
                }
            }

To display the data I use the code below. Here I use a cache, with maximal 100 rows ( less than 100 are visible).

private final ObservableList<byte[]> lines = FXCollections.observableArrayList();
private final LinkedHashMap<Integer,String[]> cache = new LinkedHashMap<>();

 class InternalReader extends Reader{

        int lineNo = 0;
        public void setLine( int lineNo ){
            this.lineNo = lineNo;
        }

        @Override
        public int read(char[] cbuf, int off, int len) throws IOException {
            final String line = new String(lines.get(lineNo ));
            final char[] charSrc = line.toCharArray();
            int got = Math.min(len, charSrc.length - off );
            System.arraycopy( charSrc, off, cbuf, 0, got );
            return got;
        }

        @Override
        public void close() throws IOException {

        }
    }

    final InternalReader internalReader = new InternalReader();
    final CsvReader renderReader = CsvReader.builder()
            .fieldSeparator(',')
            .quoteCharacter('"')
            .commentStrategy(CommentStrategy.SKIP)
            .commentCharacter('#')
            .skipEmptyRows(true)
            .errorOnDifferentFieldCount(false).build( internalReader );

    private String[] getLine( int lineIdx ){
        if ( cache.containsKey( lineIdx )) {
            return cache.get( lineIdx );
        }
        internalReader.setLine( lineIdx );
        String[] result = null;
        try {
            if ( lineIdx > -1 && lineIdx < lines.size() ) {
                Iterator<CsvRow> itr = renderReader.iterator();
                if ( itr.hasNext() ){
                    CsvRow row = itr.next();
                    result = new String[ row.getFieldCount()];
                    row.getFields().toArray( result );
                }
                cache.put( lineIdx, result );
                renderReader.resetBuffer();
            }
        } catch ( Exception ex ){
            ex.printStackTrace();
        }
        if ( cache.size() > 100 ){
            Integer keyToRemove = cache.keySet().iterator().next();
            cache.remove( keyToRemove );
        }
        return result;
    }
osiegmar commented 2 years ago

For now, I am parsing the file, split in records, and save them as List<byte[]>.

I think that (some of) your problems are related to exactly this. From your code above I don't see any reason for this. Plus I don't see a reason for doing this at all.

I think the approach with accessing the File is more difficult when will be about editing records and filtering.

I highly doubt that but without code, it's impossible to judge.

Is it possible to add a CSVFormat with some pre-defined formats

The pre-defined format is the format that is described in RFC 4180. Other (non-standard) formats can be "pre-defined" by the user – like:

CsvReader.CsvReaderBuilder format1 = CsvReader.builder()
    .skipEmptyRows(false);

CsvReader.CsvReaderBuilder format2 = CsvReader.builder()
    .commentStrategy(CommentStrategy.SKIP);

I don't want to implement other software/vendor specific formats, especially as it is only a first step anyway. It can't be pre-defined if headers are in use for example. And I don't want to maintain any possible change of those external implementations.

It would be great if having a CsvRow and CsvWriter it would be possible to simply write the record to a String

Please keep in mind the primary goals of this library. Writing single records to a String is contrary to that. Also note that StringBuffer is pretty slow and all you want to do is to transform the String into a byte array anyway! This sounds like a job for a ByteArrayOutputStream:

ByteArrayOutputStream bos = new ByteArrayOutputStream();
Writer myWriter = new OutputStreamWriter(bos);
CsvWriter writer = CsvWriter.builder().build(myWriter);

// begin loop

writer.writeRow(...);
byte[] myRow = bos.toByteArray(); // or even .toString() if this would be required
bos.reset();

// end loop

I use this code to load the file into the List<byte[]>. The Java memory usage after running this jumps to 1.5Gb, and after a while goes back to 0.8Gb. I think running this code there are still many temporary objects created. Could you please check this?

We addressed one major source of temporary objects above – the Strings. By using the ByteArrayOutputStream you will see an improvement.

Please also note that your use of BufferedReader in order to read the file is sub-optimal as described in the JavaDoc on de.siegmar.fastcsv.reader.CsvReader.CsvReaderBuilder#build(java.io.Reader):

This library uses built-in buffering, so you do not need to pass in a buffered Reader implementation such as {@link java.io.BufferedReader}. Performance may be even likely better if you do not. Use {@link #build(Path, Charset)} for optimal performance when reading files.

If you have or want to pass in a Reader, you can simply pass in the InputStreamReader.

Another (not yet possible) optimization (in FastCSV) would be a method in CsvWriter that allows to write a CsvRow (and NamedCsvRow) directly without calling getFields() first. I will consider this.

To display the data I use the code below. Here I use a cache, with maximal 100 rows ( less than 100 are visible)

Again some extra code to handle byte[] and Strings. Why don't simply cache CsvRow or String[]? As it is only for 100 records the memory footprint should be negligible in contrast to the code to maintain.

Regarding the cache – you may replace it by a simple LRU cache implementation based on LinkedHashMap (see java.util.LinkedHashMap#removeEldestEntry) or by Googles Guava CacheBuilder.

dprutean commented 2 years ago

Thank you for your suggestions. I fixed my code to use Streams as you suggested. The cache is already using String[]. The memory usage still does not drop. May I request three improvements?

  1. Create an iterator over RowHandler, not CsvRow. The CsvRow is instantiated each time, which is using memory. It is ok to keep iterating the same object, as I will copy the values from this into my code. I think everybody will do like this, and nobody will save the CsvRow in his code.
  2. In RowHandler, move the row as String[] to List. It will again reduce the number of allocated objects.
osiegmar commented 2 years ago

The changes are now contained in the develop branch and will be (almost certainly) part of the next release.

Those other changes you're asking for today is something I'm not going to implement because of earlier design decisions. See also #51.

dprutean commented 2 years ago

Just let me inform you about the latest tests I did: I have added one more iterator method, where I return the RowHandler. In the RowHandler I use List instead of String[]. Using this and some improvements on my side, I could get 850Mb used by Java when opening a 5mio records and 600Mb file. Opening 10mio records and 1.2Gb it uses 1.7Gb in Java, which is reasonable. The interface is moving without lagging. I wrote you this, maybe I convince you to implement two iterators in your code: one mutable (for generating fewer objects) and one immutable.

I also got better memory usage if I declare the initial size for the ByteArrayOutputStream and ArrayList. If you wish, I can make the editor code public.

osiegmar commented 2 years ago

There's no need to convince me that exposing internal API (the RowHandler) has a positive effect on performance characteristics. A future version of FastCSV might use a more functional designed API to support a similar use. I just don't want offer a version of FastCSV that exposes internal API that might cause support issues and lacks a concise API design and might even break the API contract (at least in short term).

If you wish, I can make the editor code public.

That would definitely help doing further optimizations.