dateparser is a smart and high-performance date parser library, it supports hundreds of different formats, nearly all format that we may used. And this is also a showcase for "retree" algorithm.
MIT License
95
stars
24
forks
source link
(How to?) Improve performance when parsing many strings in the same format #17
I was wondering if there is an option to improve the performance even further when parsing many strings that are all in the same format.
My use-case is parsing timestamps from a CSV file where the CSV file has million of rows and each of the timestamps is in the same format.
It would be ideal if I could just say to the parser: "remember that format you detected for the previous string. I'm pretty sure this string is in the same format, so try that first when parsing this string".
To illustrate this, my situation is similar to this benchmark
package com.github.sisyphsu.dateparser.benchmark;
import com.github.sisyphsu.dateparser.DateParser;
import org.openjdk.jmh.annotations.*;
import java.util.Random;
import java.util.concurrent.TimeUnit;
@Warmup(iterations = 2, time = 2)
@BenchmarkMode(Mode.AverageTime)
@Fork(2)
@Measurement(iterations = 3, time = 3)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
public class MultiSameBenchmark {
private static String[] TEXTS;
static {
Random random = new Random(123456789l);
TEXTS = new String[10000000];
for(int i = 0; i < TEXTS.length; i++){
TEXTS[i] = String.format("2020-0%d-1%d 00:%d%d:00 UTC",
random.nextInt(8) + 1,
random.nextInt(8) + 1,
random.nextInt(5),
random.nextInt(9));
}
}
@Benchmark
public void parser() {
DateParser parser = DateParser.newBuilder().build();
for (String text : TEXTS) {
parser.parseDate(text);
}
}
}
Is there already such an option on the parser that I overlooked ?
I was wondering if there is an option to improve the performance even further when parsing many strings that are all in the same format. My use-case is parsing timestamps from a CSV file where the CSV file has million of rows and each of the timestamps is in the same format. It would be ideal if I could just say to the parser: "remember that format you detected for the previous string. I'm pretty sure this string is in the same format, so try that first when parsing this string".
To illustrate this, my situation is similar to this benchmark
Is there already such an option on the parser that I overlooked ?