uniVocity / univocity-parsers

uniVocity-parsers is a suite of extremely fast and reliable parsers for Java. It provides a consistent interface for handling different file formats, and a solid framework for the development of new parsers.
917 stars 252 forks source link

How to make InputValueSwitch context-aware when missing type information in rows #478

Open poikilotherm opened 3 years ago

poikilotherm commented 3 years ago

Hi @jbax

Thank you for this very nice library! I do have a (simple? stupid?) question about using your library in an edge case...

I have a TSV file like this (much shortened for readability): grafik

As you can see, there are 3 different schemas in that (historically grown, likely unchangeable) file format for MetadataBlock, DatasetFieldTypes and ControlledVocabularyValue. The actual entries do not have a type identifier, but are seen as "in context" of the "sections" header line.

Now I would like to use your lib for data binding, but cannot figure out how to make InputValueSwitch context aware... I'd be glad for any help on this. Thank you! :pray:

(Do I need to implement an inherited ContextSwitch extends AbstractProcessorSwitch, as overriding switchRowProcessor() from AbstractInputValueSwitch isn't possible or is there a simpler solution?)

Here's the Java code I have so far:

/**
 * This class will parse a given TSV file for Metadata Blocks, Dataset Fields and Controlled Vocabularies.
 * You may fetch the different parts and use them in tests or to update the database of a real instance.
 */
public class TsvMetadataBlockParser {

    MetadataBlock metadataBlock;
    final List<DatasetFieldType> datasetFields = new ArrayList<>();
    final List<ControlledVocabularyValue> controlledVocabularyValues = new ArrayList<>();

    final TsvParser parser;
    final BeanListProcessor<MetadataBlock> metadataBlockProcessor = new BeanListProcessor<>(MetadataBlock.class);
    final BeanListProcessor<DatasetFieldType> datasetFieldProcessor = new BeanListProcessor<>(DatasetFieldType.class);
    final BeanListProcessor<ControlledVocabularyValue> controlledVocabularyProcessor = new BeanListProcessor<>(ControlledVocabularyValue.class);

    /**
     * Create an input switch based on the first column.
     * Will contain #metadataBlock, #datasetField or #controlledVocabulary for switching context
     */
    final InputValueSwitch inputSwitch = new InputValueSwitch(0);

    public TsvMetadataBlockParser() {
        // Configure InputSwitch
        this.inputSwitch.addSwitchForValue("#metadataBlock", metadataBlockProcessor);
        this.inputSwitch.addSwitchForValue("#datasetField", datasetFieldProcessor);
        this.inputSwitch.addSwitchForValue("#controlledVocabulary", controlledVocabularyProcessor);
        this.inputSwitch.setDefaultSwitch(metadataBlockProcessor); // <- necessary as failing without, but also causing the headaches...

        TsvParserSettings settings = new TsvParserSettings();
        settings.setProcessor(inputSwitch);
        settings.setHeaderExtractionEnabled(true);
        // TODO: add error handler via settings.setProcessorErrorHandler()

        this.parser = new TsvParser(settings);
    }

    public void readTsv(File tsvFile) {
        // Do the parsing...
        parser.parse(tsvFile, StandardCharsets.UTF_8);

        System.out.println(datasetFieldProcessor.getHeaders()); // -> null
        System.out.println(datasetFieldProcessor.getBeans()); // -> empty list
    }
}