moshe / elasticsearch_loader

A tool for batch loading data files (json, parquet, csv, tsv) into ElasticSearch
MIT License
399 stars 83 forks source link

Add option --parse-types to 'csv' command to discover CSV column data types #62

Closed shaan1337 closed 5 years ago

shaan1337 commented 5 years ago

Add option --parse-types to csv command to attempt discovering data types of CSV columns. This allows dynamic mapping in elasticsearch to work for CSV files instead of using string for all fields.

The following test file was used: dataset.csv.zip

moshe commented 5 years ago

Hey @shaan1337, It always nice to know that people are using esl and willing to help (: I didn't add this kind of parsing logic to esl due to several concerns and wanted to understand the motivation behind the pr:

  1. Elasticsearch can do it for you by defining mappings and setting them by --index-settings-file(see #53 #39)
  2. Type inference can be hard and complicated and I prefer using third-party package for it (like https://github.com/frictionlessdata/tabulator-py)
shaan1337 commented 5 years ago

Hey @moshe thanks for the quick feedback! I had a use case where the fields could vary and number of fields in the csv file was really large (300+), so defining a mapping would be quite hard.

shaan1337 commented 5 years ago

hello @moshe, I'll close the PR if there are no plans to integrate this option in the near future (it's also a relatively rare use case I guess). thank you!

moshe commented 5 years ago

Thank you @shaan1337 I will ping if I will have a change in the plans.