Closed seseso closed 4 years ago
Good catch! Can you share the change you made or contribute it with PR?
What I ended up doing was to pass an com.univocity.parsers.common.AbstractParser to my mapper constructor and use it on the processRecord method. I would love to create a PR for it but my code is specific to my requirements where I don't use POJOs but java.util.Map instead. Here's an example of my mapper:
import com.univocity.parsers.common.AbstractParser;
import java.util.LinkedHashMap;
import java.util.Map;
import org.jeasy.batch.core.mapper.RecordMapper;
import org.jeasy.batch.core.record.GenericRecord;
import org.jeasy.batch.core.record.Record;
import org.jeasy.batch.core.record.StringRecord;
public class UnivocityRecordMapper<T> implements RecordMapper<StringRecord, Record<Map<String, String>>> {
private final AbstractParser parser;
public UnivocityRecordMapper(AbstractParser parser) {
this.parser = parser;
}
@Override
public Record<Map<String, String>> processRecord(StringRecord record) throws Exception {
Map<String, String> rValue = new LinkedHashMap<>();
String payload = record.getPayload();
com.univocity.parsers.common.record.Record rec = parser.parseRecord(payload);
rec.fillFieldMap(rValue);
return new GenericRecord<>(record.getHeader(), rValue);
}
}
After some benchmarks, It seems the performance issue is not related to the fact that the parser is not being shared. I tried to extract it from the processRecord
method and the performance is the same. Here is the modified version:
import com.univocity.parsers.common.AbstractParser;
import com.univocity.parsers.common.CommonParserSettings;
import com.univocity.parsers.common.processor.BeanListProcessor;
import org.jeasy.batch.core.mapper.RecordMapper;
import org.jeasy.batch.core.record.GenericRecord;
import org.jeasy.batch.core.record.Record;
import java.io.StringReader;
/**
* A record mapper that uses <a href="http://www.univocity.com/">uniVocity parsers</a> to map delimited records to
* domain objects.
*
* @param <S> The settings type that is used to configure the parser.
* @author Anthony Bruno (anthony.bruno196@gmail.com)
*/
abstract class AbstractUnivocityRecordMapper<T, S extends CommonParserSettings<?>> implements RecordMapper<String, T> {
S settings;
private AbstractParser<S> parser;
private BeanListProcessor<T> beanListProcessor;
/**
* Creates a new mapper that uses <a href="http://www.univocity.com/">uniVocity parsers</a> to map delimited records
* to domain objects.
*
* @param recordClass the target type
* @param settings the settings that is is used to configure the parser
*/
AbstractUnivocityRecordMapper(Class<T> recordClass, S settings) {
this.settings = settings;
this.beanListProcessor = new BeanListProcessor<>(recordClass, 1);
this.settings.setProcessor(beanListProcessor);
this.parser = getParser();
}
@Override
public Record<T> processRecord(Record<String> record) {
String payload = record.getPayload();
parser.parse(new StringReader(payload));
T result = beanListProcessor.getBeans().get(0);
return new GenericRecord<>(record.getHeader(), result);
}
protected abstract AbstractParser<S> getParser();
}
I even extracted the BeanListProcessor
from that method but that didn't help neither (Even though this did not improve the performance, I still introduced the change as I think it's a good refactoring, see 926396bee45f50ff0b665d97646a873900f63735).
The real issue is coming from the fact that in your version, you are converting the parsed record to a Map
(ie no reflection is used), while in the current version the record is converted to a java bean (with reflection behind the scene).
I tried to exclude Easy Batch from the picture and compare these two ways of converting data with univocity's APIs only:
@Test
public void testUnivocityPerformance() {
// /!\ This is quick and dirty, should be a JMH-based throughput micro-bench but I didn't have time to do it :-(
String payload = "foo,bar,15,true";
// parse and convert to map (1021ms)
CsvParser parser = new CsvParser(new CsvParserSettings());
Map<String, String> rValue = new LinkedHashMap<>();
long startTime = System.currentTimeMillis();
for (int i = 0; i < 1000000; i++) {
com.univocity.parsers.common.record.Record record = parser.parseRecord(payload);
record.fillFieldMap(rValue);
}
System.out.println((System.currentTimeMillis() - startTime) + "ms");
// parse and convert to bean (527917ms)
CsvParserSettings settings = new CsvParserSettings();
BeanListProcessor<TestBean> beanListProcessor = new BeanListProcessor<>(TestBean.class);
settings.setProcessor(beanListProcessor);
parser = new CsvParser(settings);
startTime = System.currentTimeMillis();
TestBean testBean;
for (int i = 0; i < 1000000; i++) {
parser.parse(new StringReader(payload));
testBean = beanListProcessor.getBeans().get(0);
}
System.out.println((System.currentTimeMillis() - startTime) + "ms");
}
TestBean
is the same as in here. The difference is huge: 1s vs 8min ..
I would love to create a PR for it but my code is specific to my requirements where I don't use POJOs but java.util.Map instead
Indeed, that's not compatible with the current code and we could not merge it as is. Thank you for raising the issue anyway!
Due to the instantiation of the Univocity parser for each record, the performance is very very poor (for 1 million lines, it took me more than 3 minutes versus less than 2 seconds with a "shared" instance of the parser)
To overcome this, I would suggest to move out the creation of the parser from the processRecord method.