jehugaleahsa / FlatFiles

Reads and writes CSV, fixed-length and other flat file formats with a focus on schema definition, configuration and speed.
The Unlicense
357 stars 64 forks source link

Support List/Array? Or custom parser? #26

Closed vejuhust closed 6 years ago

vejuhust commented 6 years ago

Given a .tsv file like this ---

1   -0.859164070|-0.0467129|-0.196854265|-0.1378887|0.08451792
11  0.676197956|-0.00184831|0.215737675|0.121956684|0.932423124
21  0.81346543|-0.04809262|0.20222138|-0.9014757191|-0.004015758

Its ideal schema should be ---

public class Foo
{
    public long Index { get; set; }
    public List<double> Vector { get; set; }
}

But the list or array is not supported as property type, I have to map the 2nd column to RawVector and then write post-processing code to split and parse them into the real vector.

public class Foo
{
    public long Index { get; set; }
    public string RawVector { get; set; }
    public List<double> Vector { get; set; }
}

Thus my feature request is --- do you have any plan to support list or array, or provide any interface to let developer to implement/integrate custom type parser?

Thanks!

jehugaleahsa commented 6 years ago

Let me run through some scenarios with you:

Property is a List<T>, HashSet<T>

In this case, I think it would be nice to define a property like this:

mapper.IndexProperty(x => x.Vector, 0)
mapper.IndexProperty(x => x.Vector, 1)
mapper.IndexProperty(x => x.Vector, 2)

In that case, when creating a new Foo, I would have to know whether to set foo.Vector to a new List<double>() or assume there's already one there and just call Add on it. I guess technically, I could check for null, right?

Property is an IEnumerable<T>, ICollection<T> or IList<T>

In this case, the actual type of the collection isn't known, unless it is non-null. If I had to initialize Vector, I would have to choose what type of collection to use (probably a List<double>) since the values are probably ordered. Once I knew the type, I could just call Add again.

Property is an ISet<T>

In this case, I have to initialize the collection to a HashSet<T>, unless it is non-null.

Property is an IList, but it is really a T[]

Even though arrays implement IList<T>, they are read-only. In that case, I would have to make sure the array is big enough to hold the number of values being parsed. If not, I may need to set the array property to a new T[] whose size is the max(index) + 1.

A general-purpose solution

I think it is a little counter-intuitive to call Add on a collection that may already contain values. If you say IndexProperty(x => x.Vector, 0), I think that says the value at index = 0 should be set to the parsed value. So, a more general solution is to always make sure the collection has enough elements to contain all the indexes. Mind you, there's no guarantee that the indexes passed to IndexProperty need to be contiguous; this should also be legal:

mapper.IndexProperty(x => x.Vector, 0);
mapper.IndexProperty(x => x.Vector, 1);
mapper.IndexProperty(x => x.Vector, 55432);
mapper.IndexProperty(x => x.Vector, 2);

Here's what I am thinking:

  1. If the property is null, initialize it to a new collection: 1.a. For IEnumerable, ICollection or IList, set the value to List<object>. 1.b For IEnumerable<T>, ICollection<T> or IList<T>, set the value to List<T>. 1.c For ISet<T>, set the value to HashSet<T>. 1.d If the property is a concrete type that implements IList<T>, initialize it to that concrete type. 1.e If the property is a concrete type that implements ISet<T>, initialize it to that concrete type. 1.f Otherwise, it is not recognized, so throw an exception.
  2. If the property is not null: 2.a If the value is an instance of IList or IList<T>, throw an exception if (Count <= maxIndex). 2.b Otherwise, throw an exception since we cannot index into the collection.

When initializing a new List as part of 1, the code will need to pre-populate the collection so it has a Count = maxIndex + 1, something like this:

x.Vector = new List<int>(Enumerable.Repeat(default(int), maxLength + 1));

Maybe something entirely different for sets

Maybe instead of trying to reuse IndexProperty for sets and other ICollections that may not support indexing, we instead have a separate method that just calls Add. Something like mapper.ElementProperty(x => x.Items). This would go through a similar initialization as above, except there'd no longer be a need to check for a max length or reserve elements in a new list.

I hope you can see there is a lot to think about when it comes to supporting your request. Then there's the performance aspect, which I don't even want to start thinking about.

jehugaleahsa commented 6 years ago

Another possibility is someone wanting to map non-primitive types into a collection. Say, you want a collection of People with Id, Name and BirthDate properties. The first person's properties are in index 0, 1, and 2. Then 3, 4, and 5 belong to the second person, and so on. That's a whole other level of complexity.

When you start talking indexes, you also should consider whether it makes sense to handle IDictionary<string, T> or IDictionary<int, T>, as well.

Don't get me wrong; I like these ideas and I'm thankful for the suggestion.

vejuhust commented 6 years ago

@jehugaleahsa Wow, thanks for your detailed explanation! You walked through every scenario of linear data structures, and it's hard to come up with a elegant solution. So I'll keep using my post-processing logic 😄 Apart from feature requests, it'd be great if you can write detailed documents or even build a doc website for this library. It'll help the new users and spread out this library.

jehugaleahsa commented 6 years ago

@vejuhust It's been more than a year, but I finally had an idea about how to allow for more complex mappings. As of v3.0.0 (still in beta), you should be able to use the CustomMapping method to perform custom serialization/deserialization -- including adding to collections. See the the example in the README.

vejuhust commented 6 years ago

Thanks for your effort! 👍 Btw, it's not >1 year, just 0.5 years actually. 🤣